155
The Pennsylvania State University The Graduate School Department of Computer Science and Engineering EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR KNOWLEDGE DISCOVERY A Dissertation in Computer Science and Engineering by Anirban Chatterjee c 2011 Anirban Chatterjee Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2011

EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

The Pennsylvania State University

The Graduate School

Department of Computer Science and Engineering

EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR

KNOWLEDGE DISCOVERY

A Dissertation in

Computer Science and Engineering

by

Anirban Chatterjee

c⃝ 2011 Anirban Chatterjee

Submitted in Partial Fulfillmentof the Requirements

for the Degree of

Doctor of Philosophy

December 2011

Page 2: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

The dissertation of Anirban Chatterjee was reviewed and approved* by the following:

Padma RaghavanProfessor of Computer Science and EngineeringDissertation AdviserChair of Committee

Mahmut Taylan KandemirProfessor of Computer Science and Engineering

Suzanne M. ShontzAssistant Professor of Computer Science and Engineering

Kateryna MakovaAssociate Professor of Biology

Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering

*Signatures are on file in the Graduate School.

Page 3: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

iii

Abstract

Data-driven discovery seeks to obtain a computational model of the underlying process

using observed data on a large number of variables. Observations can be viewed as points in

a high-dimensional space with coordinates given by values of the variables. It is common for

observations to have nonzero values in only a few dimensions, i.e., the data are sparse. We

seek to exploit the sparsity of the data by interpreting the observations as a sparse graph and

manipulating its geometry (or embedding) in a high-dimensional space. Our goal is to obtain al-

gorithms that demonstrate high accuracy for key problems in data analysis and parallel scientific

computing.

The first part of this dissertation focuses on combining geometry and sparse graph struc-

ture to yield more accurate classification algorithms for data mining. We have developed a

feature subspace transformation (FST) scheme that transforms the data iteratively and provides

a better selection of features to improve unsupervised classification, also known as clustering.

FST utilizes the combinatorial structure of the entity relationship graph and high-dimensional

geometry of the sparse data to iteratively bring related entities closer, which enhances cluster-

ing. Our approach improves clustering quality relative to established schemes such as K-Means

and multilevel K-Means (GraClus). Next we consider transformations to enhance supervised

classification with prelabeled data. We obtain similarity graph neighborhoods (SGN) in the

high-dimensional feature subspace of the training data and transform it by determining displace-

ments for each entity. Our SGN classifier is a supervised learning scheme that is trained on these

transformed data. The goal of our SGN transform is to increase the separation between means

Page 4: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

iv

of different classes, such that the classifier learns a better boundary. Our results indicate that

a linear discriminant and support vector machine classification on these SGN transformed data

improves accuracy by 5.0% and 4.52%, respectively.

The second part of this dissertation focuses on utilizing geometry and the structure of

sparse graphs to enhance the quality and performance of algorithms for parallel scientific com-

puting. We develop a parallel scheme, ScalaPart, for partitioning a large sparse graph into k

subgraphs such that the number of cross edges is reduced. Our scheme combines a parallel

graph embedding with a parallel geometric partitioning scheme to yield a scalable approach suit-

able for multicore-multiprocessors. Our analysis of ScalaPart demonstrates its scalability, and

our empirical evaluation indicates that the performance of ScalaPart and the quality of cuts com-

pares well with established schemes such as ParMetis and PT-Scotch. Next, we consider scalable

sparse linear system solution which plays a key role in many applications including data mining

with support vector machines and partial differential equation-based modeling and simulation.

Our graph partitioning algorithm can be used to yield a nested dissection fill-reducing ordering

and a tree for structuring numeric computations for the solution of sparse linear systems. We

develop a hybrid linear solver that couples tree-structured direct and iterative solves for repeated

right-hand side solutions. Our results indicate that our hybrid solver is 1.87 times faster than

preconditioned conjugate gradients (PCG) based methods for achieving same levels of accuracy.

In conclusion, this dissertation demonstrates that by combining geometric and combina-

torial properties of sparse graphs and matrices, we can enhance algorithms that are commonly

used in knowledge discovery through modeling and simulation.

Page 5: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

v

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

I Combining High-dimensional Geometry with Sparsity for Improving Ac-

curacy of Data Mining 6

Chapter 2. Combining Geometry and Combinatorics for Enhanced Unsupervised and Su-

pervised Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Background on Unsupervised Classification using K-Means . . . . . . . . 8

2.3 Background on Supervised Classification . . . . . . . . . . . . . . . . . . 9

2.3.1 Linear Discriminant Classifier (LD) . . . . . . . . . . . . . . . . . 9

2.3.2 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . 10

2.4 Metrics of Classification Quality . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Chapter 3. Feature Subspace Transformation for Enhancing K-Means Clustering . . . 14

3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Page 6: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

vi

3.2 FST-K-Means: Feature Subspace Transformations for Enhanced Classifi-

cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Toward optimal classification . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 4. Similarity Graph Neighborhoods for Enhanced Supervised Classification . 41

4.1 Exploiting Similarity Graph Neighborhoods for Enhancing SVM Accuracy 42

4.1.1 Determining γ-Neighborhoods in Similarity Graph G(B,A) . . . . 43

4.1.2 Transforming Training Data through Entity Displacement Vectors. . 45

4.1.3 Training an LDA or SVM on Transformed Data. . . . . . . . . . . 48

4.1.3.1 An Example of SGN Transformation using the Radial

Basis Function . . . . . . . . . . . . . . . . . . . . . . . 49

4.1.4 Classifying Test Data . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Experimental Setup and Metrics for Evaluation. . . . . . . . . . . . 50

4.2.2 Artificial Dataset Results . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.3 Empirical Results on Benchmark Datasets. . . . . . . . . . . . . . 52

4.2.4 Why does classification improve with our SGN transformation? . . 53

4.2.5 Sensitivity of SGN-SVM Classification Accuracy . . . . . . . . . . 55

4.2.5.1 Effect of Growing or Shrinking the Neighborhood on Clas-

sification Accuracy . . . . . . . . . . . . . . . . . . . . 55

Page 7: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

vii

4.2.5.2 Effect of Changing Sparsity of G(B,A) on Classification

Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

II Scalable Geometric Embedding and Partitioning for Parallel Scientific Com-

puting 67

Chapter 5. Background on Sparse Graph Partitioning and Sparse Linear System Solution 68

5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Graph Embedding and Graph Partitioning . . . . . . . . . . . . . . . . . . 69

5.2.1 Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.2 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2.3 The Charm++ Parallel Framework . . . . . . . . . . . . . . . . . . 70

5.3 Sparse Linear Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3.1 Sparse Direct Solvers . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3.2 Preconditioned Conjugate Gradient . . . . . . . . . . . . . . . . . 71

5.3.3 Incomplete Cholesky Preconditioning . . . . . . . . . . . . . . . . 72

5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Chapter 6. Parallel Geometric Partitioning through Sparse Graph Embedding . . . . . 75

6.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 77

6.2 ScalaPart: A Parallel Graph Embedding enabled Scalable Geometric Parti-

tioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Page 8: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

viii

6.2.1 Structure of our Charm++ Parallel Graph Embedding . . . . . . . . 79

6.2.2 Implementation of a Data Parallel Geometric Partitioning using Charm++ 83

6.2.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.3 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.3.1 Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . 86

6.3.2 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.3.3 Discussion on observed quality and performance . . . . . . . . . . 88

6.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Chapter 7. A Multilevel Cholesky Conjugate Gradients Hybrid Solver for Linear Systems

with Multiple Right-hand Sides . . . . . . . . . . . . . . . . . . . . . . 99

7.1 A New Multilevel Sparse Cholesky-PCG Hybrid Solver . . . . . . . . . . . 100

7.1.1 A One-level Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.1.1.1 Obtaining a Tree-structured Aggregate of the Coefficient

Matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.1.1.2 Constructing a Hybrid Solution Scheme using the Tree-

structure . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.1.2 A Multilevel Tree-based Hybrid Solver . . . . . . . . . . . . . . . 106

7.1.3 Computational Costs of our Hybrid Solver . . . . . . . . . . . . . 107

7.2 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.2.2 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . 109

7.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Page 9: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

ix

Chapter 8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Page 10: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

x

List of Tables

3.1 Test suite of datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Accuracy of classification of K-Means, GraClus and FST-K-Means. . . . . . . 30

3.3 Accuracy of classification for PCA (top three principal components) and FST-

K-Means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Improvement or degradation (negative values) percentage of accuracy of FST-

K-Means relative to K-Means and GraClus. . . . . . . . . . . . . . . . . . . . 31

3.5 Cluster cohesiveness of K-Means, GraClus, PCA (top three principal compo-

nents), and FST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Improvement (positive values) or degradation (negative values) percentage of

cohesiveness of FST-K-Means relative to K-Means and GraClus. . . . . . . . . 33

3.7 The lower bound, range and upper bound of cohesiveness across 100 runs of

FST-K-Means (top half) and K-Means in the original feature space. Observe

that FST-K-Means consistently satisfies the optimality bounds while K-Means

fails to do so for most datasets. The highlighted numbers in the lower table

indicate datasets for which the minimum value of cohesiveness exceeds the

upper bound on optimality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 Details of the fourclass dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Classification accuracy (percentage) of LDA, SGN-LDA, SVM, and SGN-SVM

for the fourclass dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 Description of UCI datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Page 11: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

xi

4.4 Classification accuracy (as a percentage) and F1-Score for LDA and SGN-LDA

on benchmark datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5 Classification accuracy (as a percentage) and F1-Score for SVM and SGN-

SVM on benchmark datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1 Details of benchmark graphs indicating the number of nodes (|V |) and edges

(|E|) in the graph with a brief description of the application domain. . . . . . . 92

7.1 Benchmark matrices from the University of Florida Sparse Matrix Collection [1]. 109

7.2 Operation counts (in millions) for 10 repeated right-hand side vectors and PCG

with IC Level-of-fill (IC0) and drop-theshold (ICT) preconditioner using (a)

natural ordering (NAT), (b) RCM ordering, and (c) nested dissection (ND) or-

dering, and (d) minimum degree (MMD) ordering. Hybrid-DI represents our

tree-based hybrid solver framework. Values in bold represent the best perfor-

mance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.3 Relative error for 10 repeated right-hand side vectors and PCG with IC Level-

of-fill (IC0) and drop-theshold (ICT) preconditioner using (a) natural ordering

(NAT), (b) RCM ordering, and (c) nested dissection (ND) ordering, and (d)

minimum degree (MMD) ordering. Hybrid-DI represents our tree-based hybrid

solver framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Page 12: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

xii

List of Figures

1.1 Raw data (for example, blogs, newsgroup posts, gene sequences, census data)

are collected and subsequently processed (for example, using tokenizers) to

obtain a dataset, where rows represent samples and columns indicate variables

(or features). A cross symbol (×) indicates a nonzero feature in the observation. 3

1.2 (a) A matrix indicating relationships among different observations. (b) An em-

bedding of the relationship graph in two dimensions providing a geometry to

the graph structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Illustration of the three main steps of FST-K-Means.

(1) A is a sparse matrix representing a dataset with 6 entities and 3 features.

B ≈ AAT is the adjacency matrix of the weighted graph G(B,A) with 6 ver-

tices and 7 edges.

(2a) FST is applied on G(B,A) to transform the coordinates of the vertices.

Observe that the final embedded graph G(B, A) has the same sparsity structure

as G(B,A).

(2b) The sparse matrix A represents the dataset with the transformed feature

space.

(3) K-Means is applied to the dataset A to produce high quality clustering. . . . 18

Page 13: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

xiii

3.2 Plots of classification accuracy and the 1-norm of feature variance vector across

FST iterations. FST iterations are continued until feature variance decreases

relative to the previous iteration. . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Layout of entities in splice in the original (top) and transformed (bottom) fea-

ture space, projected onto the first three principal components. Observe that

two clusters are more distinct after FST. . . . . . . . . . . . . . . . . . . . . . 38

3.4 Layout of entities in two dimensional synthetic dataset in the original (top) and

transformed (bottom) feature space. . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Sensitivity of classification accuracy (P) of K-Means to number of principal

components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1 Forming the sparse γ-neighborhood similarity graph G(B,A) from A. F (A)

represents the transformation described in Section 4.1.1. G(B,A) is a weighted

graph representation of matrix B. . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Entities ap, a

q, a

u, a

j, and a

krepresent the immediate neighbors of a

i. A ∆

ij

denotes the difference between two entity vectors ai

and aj. . . . . . . . . . . 59

4.3 (a) Transforming the training data A using G(B,A) to a new graph G(B, (A)).

(b) We obtain new entity coordinates A from the graph G(B, A). . . . . . . . . 60

4.4 Training an SVM on the transformed data matrix A to obtain separating hyper-

planes (shown in white). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5 Percentage improvements in accuracy and F1-score with SGN-LDA over LDA. 61

4.6 Percentage improvements in accuracy and F1-score with SGN-SVM over SVM. 62

Page 14: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

xiv

4.7 Illustration of the fourclass dataset after performing our SGN transformation.

Observe that the two classes separate out while elements maintain their relative

position within their class. A better separating boundary is obtained on this

transformed data using LDA (as shown above). . . . . . . . . . . . . . . . . . 63

4.8 Support vectors obtained (a) using SVM, and (b) using SGN-SVM training for

the fourclass dataset in the training data indicating two classes with support

vectors shown in black. Illustration of the (c) SVM separating plane, and (d)

the SGN-SVM separating plane on testing data. . . . . . . . . . . . . . . . . . 64

4.9 Illustration of two four-dimensional Gaussian processes (a) before transforma-

tion and (b) after SGN transformation that have been projected to two PCA

dimensions for the purpose of visualization. In both cases, the boundary is

obtained using LDA. Along side are the correlation matrices before and after

the transformation. The dashed boxes indicate the feature pairs that showed a

significant change in the correlation value after the SGN transformation. . . . . 65

4.10 (a) Effect of increasing the entity neighborhood parameter γ in G(B,A) (con-

sidering larger neighborhoods) on classification accuracy. When the sparse

neighborhood is a good approximation indicating similarity, adding elements

to the neighborhood does not alter classification accuracy. Effect of varying

number of common features q to create the similarity graph B on (b) sparsity of

B relative to A and (c) impact on overall classification accuracy. . . . . . . . . 66

Page 15: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

xv

6.1 Average layout time per iteration for benchmark graphs using 8, 16 and 32

cores. We include results for 8 cores to demonstrate the effect of doubling the

number of cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 Normalized edge cuts for 16 partitions with ParMetis as base set to 1. . . . . . 94

6.3 Normalized edge cuts for 32 partitions with ParMetis as base set to 1. . . . . . 94

6.4 Normalized time for (a) 16 partitions and (b) 32 partitions on 1 core with

ParMetis as base set to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.5 Normalized time for (a) 16 partitions on 16 cores and (b) 32 partitions on 32

cores with ParMetis as base set to 1. . . . . . . . . . . . . . . . . . . . . . . . 96

6.6 Normalized time for 32 partitions on 32 cores with ParMetis as base set to 1.

The above normalized times represent the total time, including time for a coars-

ening phase, embedding of the coarsened graph, partitioning of the coarsened

graph, and subsequent refinement of the cut. . . . . . . . . . . . . . . . . . . . 97

6.7 A two partition illustration of a graph from a heat exchanger flow problem using

(a) ParMetis, (b) PT-Scotch, and (c) our framework. This example demonstrates

that a geometric partitioning scheme clearly achieves a competitive edgecut

with a good underlying embedding of the graph. . . . . . . . . . . . . . . . . . 98

7.1 (a) Matrix bcsstk11 with natural ordering; (b) a one-level nested dissection or-

dering of bcsstk11; (c) a supernodal tree represention of the one-level ordering. 113

7.2 (a) Matrix bcsstk11 with natural ordering (b) A two-level nested dissection or-

dering of bcsstk11 (c) A supernodal tree represention of the two-level ordering. 114

Page 16: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

xvi

7.3 Speedup obtained by our method over (a) best PCG and ordering combination,

and (b) average PCG performance across different variants for 10 right-hand

sides. A speedup greater than 1 indicates improvement. . . . . . . . . . . . . . 116

7.4 Impact on performance with increase in tree levels and number of right-hand

sides for the bcsstk10 matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.5 Effect of increasing (a) tree levels, and (b) right-hand sides on operations count

(in millions) of our Hybrid solver compared to the best PCG performance for

10 right-hand sides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Page 17: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

xvii

Acknowledgments

I would like to acknowledge National Science Foundation (NSF) for funding my graduate

research.

I am indebted to my thesis advisor, Professor Padma Raghavan, for her guidance, support

and constant encouragement. I am grateful to my committee members Professor Mahmut Kan-

demir, Dr. Suzanne Shontz, and Dr. Kateryna Makova for their invaluable contributions towards

my research. I would also like to thank my friends and colleagues Dr. Sanjukta Bhowmick

(Assistant Professor, Department of Computer Science, University Of Nebraska), Manu Shan-

tharam (Graduate Student, The Pennsylvania State University), Michael Frasca (Graduate Stu-

dent, The Pennsylvania State University), Joshua Booth (Graduate Student, The Pennsylvania

State University), and Shad Kirmani (Graduate Student, The Pennsylvania State University) for

their insightful contributions towards our joint research activities.

Most importantly, I would like to thank my family for instilling into me the importance of

sincerity in all my endeavors, and my wife, Kolika, for being my support and inspiration during

my graduate student life at Penn State.

Page 18: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

1

Chapter 1

Introduction

Large scale simulation and modeling of processes in many applications areas, such as

text mining [2], computational biology [3], astrophysics [4], and materials processing [5], rely

on data-driven techniques involving a large number of feature variables (or dimensions). Obser-

vations can be viewed as points embedded in a d-dimensional (d ≥ 2) subspace, which provides

a geometry to the observations. In particular, for a high-dimensional space in which d is equal

to the number of variables, the coordinates are specified by values of the feature variables. It

is common for observations to have nonzero values in only a few dimensions, which makes the

data extremely sparse. A sparse graph representation of the observations provides a structure,

indicating relationships in the data. Our goal is to obtain algorithms that demonstrate improved

accuracy for key problems in data analysis and parallel scientific computing by combining geo-

metric and structural properties of the data.

We elaborate further on sparsity, geometry, and structure of high-dimensional data us-

ing an example of text data analysis. Text data is typically obtained from web crawls of social,

economic, and scientific web media such as, newgroups, blogs, journals, and similar webpages.

The crawled data files are processed and each document is represented as a vector comprising

frequency of the occurence of words from a predefined dictionary. Only a few words from the

dictionary occur in each document, which accounts for the spasity of the data. Figure 1.1 illus-

trates this using a simple example comprising twelve documents and nine features (or words).

Page 19: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

2

These twelve documents can be viewed as data points embedded in a nine-dimensional geo-

metric space of words. Figure 1.2(a) presents a sparse graph representation of the relationship

between these documents. In our example, a document i is related to another document j if they

share at least one word. Figure 1.2(a) indicates a matrix representation of the sparse graph struc-

ture. The relationships between documents are understood better when structural information

with a two-dimensional spatial representation of the same sparse matrix is augmented as shown

in Figure 1.2(b). This example clearly demonstrates that a better understanding of processes can

be obtained by combining geometric and structural properties of the underlying sparse represen-

tations.

This dissertation is organised in two parts. Part I manipulates high-dimensional geometry

using sparse graph structure to improve classification problems in data mining. Part II utilizes

geometric and structural properties of sparse graphs to develop improved algorithms in parallel

scientific computing.

In the first part of this dissertation, comprising Chapters 2, 3, and 4, we focus on en-

hancing classification by combining high-dimensional geometry with sparse graph structure. In

Chapter 2, we provide background material on unsupervised classification, supervised classifi-

cation, and force-directed graph embedding. In Chapter 3, we consider improving unsupervised

classification using our Feature Subspace Transformation (FST) scheme [6]. Traditional data

clustering methods use either geometry based (K-Means [7]) or combinatorial (multilevel K-

Means [8–10]) measures to cluster data in a high-dimensional feature space. Our FST scheme

Page 20: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

3

Fig. 1.1. Raw data (for example, blogs, newsgroup posts, gene sequences, census data) arecollected and subsequently processed (for example, using tokenizers) to obtain a dataset, whererows represent samples and columns indicate variables (or features). A cross symbol (×) indi-cates a nonzero feature in the observation.

(a) (b)

Fig. 1.2. (a) A matrix indicating relationships among different observations. (b) An embeddingof the relationship graph in two dimensions providing a geometry to the graph structure.

Page 21: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

4

brings related entities iteratively closer in the feature space by combining geometry with combi-

natorial structure using a force based method [11]. We apply traditional K-Means to the trans-

formed data to obtain FST-K-Means. Our results indicate that, on average, FST-K-Means im-

proves the internal quality metric (cluster cohesiveness) by 20.2% relative to K-Means and by

6.6% relative to Multilevel K-Means (GraClus). More significantly, FST-K-Means improves the

external quality (accuracy) by 14.9% relative to K-Means and by 23.6% relative to GraClus.

In Chapter 4, we develop our Similarity Graph Neighborhoods (SGN) based data transforma-

tion scheme for supervised classification [12]. Unlike FST, SGN is a one-step transform that is

applied to the data before the training phase of a supervised classifier. Our SGN approach trans-

forms the geometry of entities in the training data by considering similarity graph neighborhoods

in the high-dimensional feature space. An SGN classifier is a supervised classification scheme,

such as the support vector machine (SVM) [13, 14] or linear discriminant (LD) [12] classifier,

that has been trained on this transformed data. We demonstrate the accuracy of our classifier

on a suite of benchmark data. Our results indicate that SGN-LD improves accuracy by 5.0%

compared to LD, and SGN-SVM improves it by 4.62% compared to SVM.

In the second part of this dissertation, comprising Chapters 5, 6 and 7, we focus on com-

bining geometry with sparse graph structure to improve two key problems in parallel scientific

computing, namely, parallel graph partitioning and sparse linear system solution. In Chapter 5,

we present background material on graph partitioning, graph layout algorithms, and sparse linear

systems solutions. Sparse graph partitioning is an important component of many parallel scien-

tific and data mining applications, especially domain decomposition. However, scaling existing

graph partitioning algorithms in distributed multicore environments remains an open challenge.

Page 22: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

5

In Chapter 6, we develop a tree-structured, parallel geometry embedding enabled, scalable ge-

ometric partitioning using the Charm++ [15] parallel programming system. Our parallel graph

embedding and geometric partitioning scheme, ScalaPart, first generates a geometric layout for a

sparse graph by using structural properties of the graph. This layout is subsequently partitioned

using a scalable parallel geometric partitioning scheme. Parallel geometric partitioning [16, 17]

is inherently more scalable than adjacency graph-based schemes due to a significantly reduced

need for communication. However, the partition quality can depend largely on the quality of

the geometric coordinates. Our analysis and empirical evaluation demonstrates that partitioning

quality of ScalaPart compares well with leading schemes, such as ParMetis or PT-Scotch and it is

highly scalable. In Chapter 7, we develop a multilevel tree-structured hybrid solver for symmet-

ric positive definite systems of the form Ax = b with multiple right-hand sides [18]. A hybrid

solution technique that interleaves phases of direct and iterative solutions serves as an effective

alternative to preconditioned conjugate gradients (PCG) [19]. The initial tree-structured domain

decomposition [20] takes advantage of the sparsity structure of the matrix A to find small dense

sub-blocks at the leaves of the tree that can be solved using a sparse direct Cholesky [21] solver.

Subsequently, these partial solutions can be combined at the separator blocks by traversing the

tree recursively. We provide an empirical evaluation on a suite of 12 benchmark matrices from

the University of Florida sparse matrix collection [22]. Our results indicate that our tree-based

hybrid solver is 1.87 times faster than PCG using different combinations of ordering, level-of-fill,

or threshold methods with an incomplete Cholesky preconditioner.

In Chapter 8, we conclude this dissertation with a discussion of our results, open prob-

lems, and possible future research directions.

Page 23: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

Part I

Combining High-dimensional Geometry with Sparsity for

Improving Accuracy of Data Mining

6

Page 24: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

7

Chapter 2

Combining Geometry and Combinatoricsfor Enhanced Unsupervised and Supervised Classification

Many data analysis applications typically obtain a geometry for the data by representing

sample points in a high-dimensional feature space. Additionally, a combinatorial representation

of the data is constructed in the form of a similarity graph [23] where edges indicate relation-

ships amongst data points and edge-weights indicate strength of these relationships. In this part

of the dissertation, our key focus is to utilize the geometry of the data in conjunction with its

combinatorial representation to improve the quality of unsupervised and supervised classifica-

tion in data mining [24]. In Chapter 3, we present a feature subspace transformation scheme

that combines geometric and combinatorial measures to improve quality of unsupervised clas-

sification. In Chapter 4, we develop a high-dimensional transformation scheme that improves

quality of supervised classification by utilizing local neighborhoods in the similarity graph of

the training data. This chapter provides the necessary background that is required to understand

key concepts in Chapters 3 and 4.

The rest of this chapter is organised as follows. Section 2.1 introduces our notation for

Chapters 3 and 4. Section 2.2 provides a brief overview of K-Means clustering. Section 2.3

presents a brief overview of supervised classification using Support Vector Machine (SVM)

and Linear Discriminant (LD) classifiers. Section 2.4 presents key metrics used to evaluate the

quality of unsupervised and supervised classifiers.

Page 25: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

8

2.1 Notation

We represent matrices using bold upper-case alphabets, e.g., A. A lower-case alphabet

with an arrow denotes a vector, e.g., x. The notation xij

refers to the j-th component of the

i-th vector in a set of n vectors xin

. We use || · ||2

to represent the two-norm of a vector. R

denotes the set of real numbers.

2.2 Background on Unsupervised Classification using K-Means

The most common implementations of the K-Means algorithm are by Steinhaus [25],

Lloyd [26], Ball and Hall [27], and MacQueen [28]. In the first step, the four independent

implementations of K-Means initialize the positions of the centroids (the number of centroids is

equal to the number of pre-determined classes). Then the entities are assigned to their closest

centroid to form the initial clusters. The centroids are then recalculated based on the center of

mass of these clusters, and the entities are reassigned to clusters according to the new centroids.

These steps are repeated until convergence is achieved.

Graph approaches to classification have also become popular, especially, multilevel K-

Means (GraClus) [8] that performs unsupervised classification through graph clustering. The

dataset is represented as a weighted graph, where each entity is a vertex and vertices are con-

nected by weighted edges. High edge weights denote greater similarity between corresponding

entities. Similar entities are identified by clustering of vertices with heavy edge weights. Gra-

Clus uses multilevel graph coarsening algorithms. The initial partition of the graph is generated

using a spectral algorithm [29], and the subsequent coarsening stages use a weighted kernel

K-Means.

Page 26: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

9

2.3 Background on Supervised Classification

Supervised classification techniques use prelabeled data to learn models that can subse-

quently be applied to classify unlabeled data. In this section, we provide a brief overview of

two popular supervised learning schemes: (i) Linear Discriminant Analysis, and (ii) Support

Vector Machine. LDA and SVM represent two distinct types of classifiers; the former relies

on constructing a covariance based discriminant function, and the latter learns coefficients of a

separating plane in high-dimensions to classify future data.

2.3.1 Linear Discriminant Classifier (LD)

The goal of Linear Discriminant Classifier (LD) is to express a predictor variable as a

linear combination of various features in the data. Consider a set of n samples xi, i = 1, . . . , n

and their corresponding class vector y, where yi∈ 0, 1 for binary classification. LD analysis

is based on the assumption that the class conditional probability distributions are normally dis-

tributed with mean and covariance (µ0,Σ

0) and (µ

1,Σ

1). Additionally, LD analysis imposes

the condition that all classes of data in the training set have the same covariance parameter, that

is, Σ0

is equal to Σ1

. Classification using LD is obtained by computing the log-likelihood ratios

for the two classes and assigning samples to the first class if this ratio is greater than a threshold

τ or to the second class if this ratio is less than τ . Equation 2.1 represents the LD classification

criteria, where Σ0= Σ

1.

(x− µ0)TΣ−10

(x− µ0)− (x− µ

1)TΣ−11

(x− µ1) < τ (2.1)

Page 27: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

10

2.3.2 Support Vector Machine (SVM)

Supervised learning using support vector machines was first proposed by Vladimir Vap-

nik as a support vector regression [13, 14] method. Subsequently, SVM gained popularity as a

classifier that requires only a subset of the training points to build maximally separated hyper-

planes to classify the data.

Consider a training set of data (x1, y

1), (x

2, y

2),. . .,(x

n, y

n) with x

i∈ Rn and y

i∈

−1,+1. Define ξi

as the training error associated with the ith data point and C as a constant

defining a bound on the training error. We formulate and solve an optimization problem that

seeks to maximize the separation between separating hyperplanes. The goal is to find w, which is

the vector of weights that defines the separating plane wTx = 0. In a binary SVM classification

problem, a simple classification rule h(w) = sign(wTx + b) is used to classify the given test

data, where b can be easily modeled. Equation (2.2) represents the primal form of the problem.

minw,ξ

i≥0

1

2wTw +

C

n

n∑i=1

ξi

(2.2)

s.t.yi(w

Txi) ≥ 1− ξ

i: ∀i ∈ 1, · · · , n

Alternatively, recent implementations of SVM [30] solve the equivalent dual formulation of

the primal problem (stated in Equation 2.2). Equation 2.3 presents the dual formulation of the

primal SVM optimization problem. The variables αi’s indicates the i-th Lagrange multiplier and

ci∈ 0, 1 indicates whether the constraint is active (c

i= 1) or inactive (c

i= 0).

Page 28: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

11

maxα≥0

1n

∑i

||ci||1αi− 1

2

∑i

∑j

αiαjxT

ixj

s.t.∑i

αi≤ C

(2.3)

There are multiple implementations of the SVM classifier available [30–34]. For our re-

search, we use Joachim’s SVM-Perf [30] package that solves the SVM dual optimization prob-

lem using the active set method [57].

2.4 Metrics of Classification Quality

The quality of unsupervised classification is typically evaluated using external and in-

ternal metrics. In addition to external quality, supervised classification measures precision and

recall to evaluate quality.

The external metric evaluates the accuracy of classification (the higher the value, the

better the classification) as the ratio of correctly classified entities to the total number of entities.

This metric is dependent on a subjective pre-determined labeling. This measure, though an

indicator of correctness, cannot be incorporated into the design of classification algorithms. We

henceforth refer to this metric as “accuracy” denoted by P defined as,

P =Number of correctly classified entities

Total number of entities. (2.4)

The internal metric measures the cohesiveness of the clusters, i.e., the sum of the square

of the distance of the clustered points from their centroids. We henceforth refer to this metric as

“cohesiveness” denoted by J . If M1, . . . ,M

kare the k clusters and µ

irepresents the centroid

Page 29: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

12

of cluster Mi, the cohesiveness J is defined as:

J = Σk

h=1Σx∈M

h

∥x− µh∥2 (2.5)

If similar entities are clustered closely, then they are generally easier to classify and

so a lower value of J is preferred. The cohesiveness is an internal metric because it can be

incorporated into the design of the classification algorithm; several classification methods seek

to minimize the objective function given by J .

We evaluate the classification quality in further detail using two additional metrics, (a)

precision, and (b) recall. In order to understand precision and recall, we first define four cate-

gories: true positive, true negative, false positive, and false negative. True positive (tp) indicates

a scenario where the class was positive and the prediction was positive. True negative (tn) in-

dicates a scenario where the class was negative and the prediction was negative. False positive

(fp) indicates a scenario where the class was positive and the prediction was negative. False

negative (fn) indicates a scenario where the class was negative and the prediction was positive.

Precision is the ratio of true positive to the sum of true positive and false positive as shown in

Equations 2.6. Recall is the ratio of true postive to the sum of true positive and false negative as

shown in 2.7.

precision =tp

tp+ fp(2.6)

recall =tp

tp+ fn(2.7)

Page 30: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

13

2.5 Chapter Summary

In this chapter, we briefly discussed necessary background on data analysis using clas-

sification techniques. In Section 2.2, we discussed the basic K-Means clustering algorithm and

its popular combinatorial variant GraClus. Subsequently, we discussed the algorithmic steps of

two important supervised classification techniques, namely, support vector machines and linear

discriminant classifiers. Finally, we discussed metrics that are used to evaluate unsupervised and

supervised classification techniques.

Page 31: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

14

Chapter 3

Feature Subspace Transformationfor Enhancing K-Means Clustering

Unsupervised classification is used to identify similar entities in a dataset and is exten-

sively used in many applications domains such as spam filtering [35], medical diagnosis [36],

demographic research [37], etc. Unsupervised classification using K-Means generally clusters

data based on (i) distance-based attributes of the dataset [25–28] , or by (ii) combinatorial prop-

erties of a weighted graph representation of the dataset [8].

Classification schemes, such as K-Means [7], that use distance-based attributes view

entities of the dataset as existing in an n-dimensional feature space. The value of the i-th feature

of an entity determines its coordinate in the i-th dimension of the feature space. The distance

between the entities is used as a classification metric. The entities that lie close to each other are

assigned to the same cluster.

Combinatorial techniques for clustering, such as multilevel K-Means (GraClus) [8], rep-

resent the dataset as a weighted graph, where the entities are represented by vertices. The edge

weights of the graph indicate the degree of similarity between the entities. A highly weighted

subgraph forms a class, and its vertices (entities) are given the same label.

In this chapter, we present a feature subspace transformation (FST) scheme to transform

the dataset before the application of K-Means (or other distance-based clustering schemes). A

unique attribute of our FST-K-Means method is that it utilizes both distance-based and com-

binatorial attributes of the original dataset to seek improvements in the internal and external

Page 32: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

15

quality metrics of unsupervised classification. FST-K-Means starts by forming a weighted graph

with the entities as vertices that are connected by weighted edges indicating a measure of hav-

ing shared features. The vertices of the graph are initially viewed as being embedded in the

high-dimensional feature subspace, i.e., the coordinates of each vertex (entity) are given by the

values of its feature vector in the original dataset. This initial layout of the weighted graph is

transformed by a special form of a force-directed graph embedding algorithm that attracts similar

entities. The nodal coordinates of the embedded graph provide a feature subspace transformation

of the original dataset; K-Means is then applied to the transformed dataset.

The remainder of this chapter is organized as follows. In Section 3.1, we provide a brief

review of related classification schemes and a graph layout algorithm that we adapt for use in

our FST-K-Means. In Section 3.2, we develop FST-K-Means including its feature subspace

transformation scheme and efficient implementation. In Section 3.3, we provide an empirical

evaluation of the effectiveness of FST-K-Means using a test suite of datasets from a variety

of domains. In Section 3.4, we attempt to quantitatively characterize the performance of FST-

K-Means by demonstrating that the clustering obtained from our method satisfies optimality

constraints for the internal quality. We present brief concluding remarks in Section 3.5. We

would like to note that content in this chapter is mainly obtained from our research that appears

in [6].

3.1 Preliminaries

We now provide a brief overview of the Fruchterman-Reingold (FR) [11] graph layout

method that we later adapt in Section 3.2 for use as a component in our FST-K-Means.

Page 33: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

16

Graph layouts with the Fruchterman and Reingold scheme. Graph layout methods are de-

signed to produce aesthetically pleasing graphs embedded in a two-(or three)-dimensional space.

A popular technique is the Fruchterman and Reingold (FR) [11] algorithm that considers a graph

as a collection of objects (vertices) connected together by springs (edges). Initially, the vertices

in the graph are placed at randomly assigned coordinates. The FR model assumes that there are

two kinds of forces acting on the vertices: (i) attraction only between connected vertices due

to the springs, and (ii) repulsion between all vertices due to mutually charged objects. If the

Euclidean distance between two vertices u and v is duv

and k is a constant proportional to the

square root of the ratio of the embedding area by the number of vertices, then the attractive force

FFR

Aand the repulsive force F

FR

Rare calculated as:

FFR

A(u, v) = −k2/d

uvand F

FR

R(u, v) = d

2

uv/k.

At each iteration of FR, the vertices are moved in proportion to the calculated attractive or re-

pulsive forces until the desired layout is achieved.

In Section 3.2, we adapt the FR algorithm as a component in our feature subspace trans-

formation in high dimensions. Through this transformation, we essentially seek embeddings that

enhance cluster cohesiveness while improving the accuracy of classification.

3.2 FST-K-Means: Feature Subspace Transformations for Enhanced Classifica-

tion

We now develop our feature subspace transformation scheme which seeks to produce

a transformed dataset to which K-Means can be applied to yield high quality classifications.

Page 34: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

17

Our FST-K-Means algorithm seeks to utilize both distance-based and combinatorial measures

through a geometric interpretation of the entities of the dataset in its high-dimensional feature

space.

Consider a dataset of N entities and R features represented by an N×R sparse matrix A.

The i-th row, ai,∗ of the matrix A, represents the feature vector of the i-th entity in the dataset.

Thus, we can view entity i as being embedded in the high-dimensional feature space (dimension

≤ R) with coordinates given by the feature vector ai,∗.

The N × N matrix B ≈ AAT represents the entity-to-entity relationship. B forms

the adjacency matrix of the undirected graph G(B,A), where bi,j

is the edge weight between

vertices i and j in the graph, and ai,k

is the coordinate of vertex i in the k-th dimension.

FST is then applied to G(B,A) to transform the coordinates of the vertices producing

the graph G(B, A). It should be noted that the structure of the adjacency matrix i.e., the set

of edges in the graph, remains unchanged. The transformed coordinates of the vertices are

now represented by the N × R sparse matrix A. The transformed dataset or equivalently, its

matrix representation A, has exactly the same sparsity structure as the original, i.e., ai,j= 0

if and only if ai,j= 0. However, typically a

i,j= a

i,jas a result of the feature subspace

transformation, thus changing the embedding of entity i. Now the i-th entity is represented in a

transformed feature subspace (dimension ≤ R) by the feature vector ai,∗. K-Means is applied

to this transformed data set represented by A to yield FST-K-Means.

FST-K-Means comprises three mains steps: (i) forming an entity-to-entity sparse, weighted

embedded graph G(B,A), (ii) feature subspace transformation (FST) of G(B,A) to yield trans-

formed embedded graph G(B, A), and (iii) applying K-Means to A for classification.

Page 35: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

18

Fig. 3.1. Illustration of the three main steps of FST-K-Means.(1) A is a sparse matrix representing a dataset with 6 entities and 3 features. B ≈ AA

T is theadjacency matrix of the weighted graph G(B,A) with 6 vertices and 7 edges.

(2a) FST is applied on G(B,A) to transform the coordinates of the vertices. Observe thatthe final embedded graph G(B, A) has the same sparsity structure as G(B,A).

(2b) The sparse matrix A represents the dataset with the transformed feature space.

(3) K-Means is applied to the dataset A to produce high quality clustering.

Page 36: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

19

Algorithm 1 procedure FST(A)n← entitiesB ← AA

T

for i = 1 to MAX ITER doInitialize displacement ∆← (0

T

1, . . . , 0

T

n)

Compute displacementfor all edges(u, v) in G(B,A) do

FA

uv←−k2×w

uvduv×i

dist = ||(Av−A

u)||

∆u= ∆

u+

(Av−A

u)

dist × FA

uv

∆v= ∆

v−

(Av−A

u)

dist × FA

uvend for

Update entity positionsfor j = 1 to n do

Aj← A

j+∆

jend for

end for

Page 37: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

20

Figure 3.1 illustrates these steps using a simple example.

Forming an entity-to-entity, weighted embedded graph G(B,A). Consider the dataset

represented by the N × R sparse matrix A. Form B ≈ AAT . Although, B could be computed

exactly as AAT , approximations that compute only a subset of representative values may also

be used. Observe that bi,j

, which represents the relationship between entities i and j, is given

by ai,∗ · aj,∗, i.e., the dot product of the feature vectors of the i-th and j-th entities; thus, b

i,jis

proportional to their cosine distance in the feature space. Next, view the matrix B as represent-

ing the adjacency matrix of the undirected weighted graph G(B), where vertices (entities) v and

u are connected by edge (u, v) if bu,v

is nonzero; the weight of the edge (u, v) is set to bu,v

.

Finally, consider the weighted graph G(B) of entities as being located in the high-dimensional

feature space of A, i.e., vertex v has coordinates av,∗. Thus, G(B,A) represents the combi-

natorial information of the entity to entity relationship similar to graph clustering methods like

GraClus. However, G(B,A) uses the distance attributes in A to add geometric information in

the form of the coordinates of vertices (entities).

Feature Subspace Transformation of G(B,A) to G(B, A). We develop FST as a vari-

ant of the FR-graph layout algorithm [11], which is described in Section 3.1, to obtain G(B, A)

from G(B,A).

Although FST is motivated from the FR-graph layout algorithm, it is significantly differ-

ent in the following aspects.

(i) FST operates in the high-dimensional feature subspace unlike the FR scheme which seeks

Page 38: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

21

layouts in two or three dimensions. Thus, unlike FR which begins by randomly assigning coor-

dinates in 2 or 3 dimensional space to the vertices of a graph, we begin with an embedding of

the graph in the feature subspace.

(ii) The original FR scheme assumes that the vertices can be moved freely in any dimension of

the embedding space. However, for our purposes, it is important to restrict their movement to

prevent entities from developing spurious relationships to features, i.e., relationships that were

not present in the original A. We therefore allow vertices to move only in the dimensions where

their original feature values are nonzero.

(iii) The goal of FST is to bring highly connected vertices closer in the feature space, in contrast

to FR objectives that aim to obtain a visually pleasing layout. Therefore at each iteration of

FST, we move the vertices based only on the attractive force (between connected vertices) and

eliminate the balancing effect of the repulsive force. Furthermore, we scale the attractive force

by the edge weights to reflect higher attraction from greater similarity between the correspond-

ing entities. Together, these modifications can cause heavily connected vertices to converge to

nearly the same position. While this effect might be desirable for classification, it hampers the

computation of attractive forces for the next iteration of FST. In particular, very small distances

between vertices cause an overflow error due to division by zero. We mitigate this problem by

scaling the force by the number of iterations to ensure that at higher iterations, the effect of the

displacement is less pronounced. In summary, at FST iteration i, the attractive force between

two vertices u and v with edge weight wu,v

and Euclidean distance du,v

is given by:

FFST

i= −

k2 ∗ w

u,v

du,v∗ i

.

Page 39: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

22

In the expression above, k is a constant proportional to the square root of the ratio of the embed-

ding area by the number of vertices as in the original FR scheme.

Applying K-Means to A for Classification. A forms the matrix representation of the

dataset in the transformed feature space, where ai,∗ represents the new coordinates of vertex

(entity) i. The sparsity structure of A is identical to that of A, the original dataset. K-Means is

then applied to A, the transformed dataset.

Computational Costs and Stopping Criteria of FST. In FST-K-Means, the cost per

iteration of K-Means stays unaffected because it operates on A which has exactly the same non-

zero pattern as the original A, i.e., ai,j= 0 if and only if a

i,j= 0. Additionally, although FST

may change the number of K-Means iterations, it should not affect the worst case complexity

which is superpolynomial with a lower bound of 2Ω√N iterations for a dataset of N entries [38].

Thus the main overheads in FST-K-Means are those for FST.

The first step, that of forming G(B,A), is similar to the graph setup step in GraClus

and other graph clustering methods. Even if B is computed exactly, its costs are no more than

Σi=1:n

nnz2

iwhere nnz

iis the number of nonzero feature values of the i-th entity. The cost of

FST is given by the number of iterations multiplied by the cost per iteration, which is propor-

tional to the number of edges in the graph of B.

Depending on the degree of similarity between the entities, G(B,A) can be dense, thereby

increasing the cost per iteration of FST. Consequently, an implementation of FST could benefit

from using sparse approximations to the graph through sampling, though this could adversely

Page 40: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

23

affect the classification results.

Stopping Criteria for FST. The number of iterations for an ideal embedding using FST

varies according to the feature values in the dataset. In this subsection, we describe how we

identify a convergence criteria that promotes improved classification.

An ideal embedding would simultaneously satisfy the two following properties: (i) sim-

ilar entities are close together, and (ii) dissimilar entities are far apart. The first property is

incorporated into the design of FST. We, therefore seek to identify a near-ideal embedding based

on the second property, i.e., the inter-cluster distance [39], which is estimated by the distance of

the entities from their global mean.

We next show in Lemma 3.1 below how the distance of the entities from their global

mean is related to feature variance. We use this relation to determine our convergence test for

terminating FST iterations.

Lemma 3.1. Consider a dataset of N entities and R features represented as a sparse matrix A;

the entity i can be viewed as embedded in the feature space at the coordinates given by the i-th

feature vector, ai,∗ = [a

i,1, · · · , a

i,R]. Let f

qbe the feature variance vector, and let d

qbe the

vector of the distance of each entity from the global mean. i.e., the centroid of all entities, at

iteration q of FST. Now the following relation is satisfied by the f and d vectors:

∥fq∥1=

1

N∥d

q∥2.

Page 41: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

24

Proof. Let ai,j

denote the feature j of entity i. Now ai,j

is also the j-th coordinate of entity

i. Let the mean of the feature vector a∗,j be τj= 1

NΣN

k=1(a

k,j) and the variance of feature

vector a∗,j be ϕj= 1

NΣN

k=1(a

k,j− τ

j)2.

Let fq= [ϕ

1, ϕ

2, · · · , ϕ

R] be the vector of feature variances. Now ∥f

q∥1= Σ

R

j=1|ϕj| =

ΣR

j=1ϕj, because ϕ

j≥ 0.

Let the global mean of the entities be represented by µ. The i-th coordinate of µ is

calculated as µi= 1

NΣN

k=1(a

k,i) = τ

i.

Let di

be the distance of entity i from µ. Therefore δi=

√ΣR

k=1(a

i,k− µ

k)2 =

√ΣR

k=1(a

i,k− τ

k)2. Let d

q= [δ

1, δ2, · · · , δ

N] be the vector representing the distance of the

entities from the global mean µ. The 2-norm of dq

is given by;

∥dq∥2

= ΣN

i=1(δi)2

= ΣN

i=1(Σ

R

k=1(a

i,k− τ

k)2)

= ΣR

k=1(Σ

N

i=1(a

i,k− τ

k)2)

= ΣR

k=1(Nϕ

k)

= N∥fq∥1

A high value of ∥fq∥1

implies that there is a high variance among the features of the

entities relative to the transformed space. High variance among features indicates easily dis-

tinguishable entities. Lemma 3.1 shows that ∥fq∥1= 1

N ∥dq∥2, i.e., high feature variance is

Page 42: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

25

proportional to the average distance of entities from their global mean. The value of dq

can be

easily computed, and we base our stopping criteria on this value.

Our heuristic for determining the termination of FST is as follows: continue layout it-

erations until ∥di+1∥2≤ ∥d

i∥2

. That is, we terminate FST when successive iterations fail to

increase the global mean of the distance of the entities. As illustrated in Figure 3.2, this heuristic

produces an embedding that can be classified with high accuracy for a sample dataset.

Impact of FST on clustering. We now consider the impact of FST on a sample dataset,

namely splice [40], with two clusters. Figure 3.3 shows the layout of the entities of splice, in both

the original and the transformed feature space. For ease of viewing, the entities are projected to

the first three principal components. Observe that the clusters are not apparent in the original

dataset. However, FST repositions the entities in the feature space into two distinct clusters,

thereby potentially facilitating classification.

3.3 Evaluation and Discussion

We now provide an empirical evaluation of the quality of classification using FST-K-

Means. We define external and internal quality metrics for evaluating classification results. We

use these metrics for a comparative evaluation of the performance of FST-K-Means, K-Means,

and GraClus on a test suite of eight datasets from a variety of applications.

Experimental setup. Our experiments compare the accuracy of our feature subspace

transformation method coupled with K-Means clustering versus a commercial implementation

of K-Means. We first apply K-Means clustering to the unmodified collection of feature values as

Page 43: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

26

obtained from the dataset. We refer to the use of the algorithm on this dataset as K-Means. We

then apply K-Means clustering to the dataset transformed using FST as described in Section 3.2.

We refer to this scheme as FST-K-Means in our experimental evaluation.

In both set of experiments, we use K-Means based on Llyod’s algorithm (as implemented

in the MATLAB [41] Statistical Toolbox), where the centroids are initialized randomly and K-

Means is limited to a maximum of 25 iterations. The quality of clustering using K-Means varies

depending on the initial choice of centroids. To mitigate this effect, we execute 100 runs of K-

Means, each time with a different set of initial centroids. To ensure a fair comparison, the initial

centroids for each execution of FST-K-Means are selected to be exactly the same as those for the

corresponding K-Means execution.

The value of the cohesiveness metric is dependent on the feature vector of the entities.

Therefore, for the set FST-K-Means where the values of the features have been modified, we

compute cohesiveness by first identifying the elements in the cluster and then transforming them

back to the original space. We thereby ensure that the modified datasets are used exclusively

for cluster identification and that the values of cohesiveness are not affected by the transformed

subspace.

The GraClus scheme is executed on the weighted graph representation. The results of

GraClus do not significantly change across runs. Consequently, one execution of this scheme is

sufficient for our empirical evaluation.

The computational cost of FST is given by the number of nonzeros in matrix B, which

is less than O(N2) per iteration, where N is the number of entities in the dataset.

Page 44: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

27

Evaluation of FST-K-Means on a synthetic dataset. We first evaluate our model using arti-

ficially generated data. Our goal is to easily demonstrate and explain the steps of our FST-K-

Means algorithm using this data before we test our method on benchmark datasets from different

repositories.

Figure 3.4 represents our test case in which three Gaussian clusters are formed in two

dimensions using three different cluster means and covariance matrices. Figure 3.4(a) represents

the original synthetic data. Figure 3.4(b) represents the same data points after the FST transfor-

mation has been applied to it. Even before we proceed to formally clustering the two datasets

(original and transformed), a visual inspection of the layout suggests that clusters obtained using

the transformed dataset are likely to be distinct. Finally, on clustering the two datasets using

K-Means, we observe that K-Means on the transformed dataset is 3% more accurate than K-

Means on the original data. Matrices ZK−Means

and ZFST−K−Means

in Equations 3.1 and

3.2, respectively, indicate the confusion matrices obtained using the two datasets. In our exam-

ple, the cluster label of each entity is known apriori, therefore, after clustering the original or

transformed data, we compare it to the true cluster label. The diagonal element Z(ii)

K−Meansof

matrix ZK−Means

indicates the number of entities that were assigned correctly to cluster i and

the off-diagonal element Z(ij)

K−Means, where i = j indicates the number of elements of cluster

i that were mapped to cluster j (ZFST−K−Means

is formed in the same way as ZK−Means

).

We observe that our FST transformation scheme clearly obtains higher precision and accuracy.

ZK−Means

=

59 16 25

6 78 16

6 3 91

(3.1)

Page 45: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

28

ZFST−K−Means

=

68 11 21

10 82 8

7 8 85

(3.2)

Evaluation of FST-K-Means on benchmark datasets We test K-Means and FST-K-Means

classification algorithms on datasets from the UCI [42], Delve [40], Statlog [43], SMART [44]

and Yahoo 20Newsgroup [45] repositories. The parameters of each dataset, along with their

application domain, are shown in Table 3.1. Datasets dna, 180txt, and 300txt have three classes.

The rest of the datasets represent binary classification problems. We select two classes, comp.graphics

and alt.atheism, from the Yahoo 20Newsgroup dataset and form the 20news binary classification

set.

Name Samples Features Source(Type)adult a2a 2,265 123 UCI(Census)australian 690 14 UCI(Credit Card)breast-cancer 683 10 UCI(Census)dna 2,000 180 Statlog(Medical)splice 1,000 60 Delve(Medical)180txt 180 19,698 SMART(Text)300txt 300 53,914 SMART(Text)20news 1,061 16,127 Yahoo

Newsgroup(Text)

Table 3.1. Test suite of datasets.

Page 46: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

29

Comparative evaluation of K-Means, GraClus, and FST-K-Means. We now com-

pare the quality of classification using the accuracy, and cohesiveness metrics for the datasets in

Table 3.1 using K-Means, GraClus, and FST-K-Means.

Accuracy. Table 3.2 reports the accuracy of classification of the three schemes. The

values for K-Means and FST-K-Means are the mode (most frequently occurring value) over

100 executions. These results indicate that with the exception of the breast-cancer dataset, the

accuracy of FST-K-Means is either the highest value (5 out of 8 datasets) or is comparable to

the highest value. In general, GraClus has lower accuracy than either FST-K-Means or K-Means

with comparable values for only two of the datasets, namely, dna and 180txt.

Table 3.4 shows the improvement in accuracy of FST-K-Means relative to K-Means and

GraClus. The improvement in accuracy as a percentage of method A relative to method B, is

defined as (P (A)−P (B))×100P (B)

, where P (A) and P (B) denote accuracies of methods A and B,

respectively; positive values represent improvements, while negative values indicate degradation.

We see that the improvement in accuracy for FST-K-Means can be as high as 57.6% relative to

K-Means (for 20news) and as high as 47.7% relative to GraClus (for 300txt). On average, the

accuracy metric obtained from FST-K-Means shows an improvement of 14.9% relative to K-

Means and 23.6% relative to GraClus.

Comparison of FST-K-Means with K-Means after PCA dimension reduction. Table 3.3

compares the accuracy obtained by K-Means using top three principal components (PCA-K-

Means) of the dataset with FST-K-Means. We observe that in seven out of eight datasets FST-

K-Means performs better than PCA-K-Means and on average achieves 24.27% better accuracy

Page 47: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

30

DatasetsClassification Accuracy (P)

K-Means GraClus FST-K-Meansadult a2a 70.60 52.49 74.17australian 85.51 74.20 85.36breast-cancer 93.70 69.69 83.16dna 72.68 70.75 70.75splice 55.80 53.20 69.90180txt 73.33 91.67 91.67300txt 78.67 64.33 95.0020news 46.74 54.85 73.70

Table 3.2. Accuracy of classification of K-Means, GraClus and FST-K-Means.

relative to PCA-K-Means. This is an empirical proof that K-Means benefits from the high-

dimensional embedding of FST. We also observe that the relative improvement in accuracy be-

tween PCA-K-Means and FST-K-Means is particularly high for text datasets (180txt, 300txt,

and 20news). Typically, feature selection in noisy text data is a difficult problem, and the right

set of principal components that can provide high accuracy is often unknown [46]. Our fea-

ture subspace transformation brings related entities closer, in effect, common features gain more

importance than unrelated features.

Additionally, in Figure 3.5 we present a sensitivity study of using upto 50 principal com-

ponents on K-Means accuracy. Table 3.3 reports results for top three principal components,

while Figure 3.5 presents the accuracy of PCA-K-Means upto 50 principal components. It is

clear that FST-K-Means performs better than PCA-K-Means.

Cohesiveness. Table 3.5 compares cluster cohesiveness, the internal quality metric across

all three schemes. Once again, the values for K-Means and FST-K-Means are the mode (most

Page 48: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

31

DatasetsClassification Accuracy (P) RelativePCA FST-K-Means Improvement

adult a2a 65.81 74.17 +12.70australian 71.17 85.36 +19.93breast-cancer 96.05 83.16 -13.42dna 60.41 70.75 +17.12splice 65.10 69.90 +7.37180txt 61.78 91.67 +48.38300txt 62.00 95.00 +53.2220news 49.50 73.70 +48.49

Table 3.3. Accuracy of classification for PCA (top three principal components) and FST-K-Means.

DatasetsRelative Improvement

(Accuracy)K-Means GraClus

(percentage)adult a2a 5 41.3australian -1 15breast-cancer -11 19dna -2 0splice 25.2 31.4180txt 25.0 0300txt 20.75 47.720news 57.6 34.3Average 14.9 23.6

Table 3.4. Improvement or degradation (negative values) percentage of accuracy of FST-K-Means relative to K-Means and GraClus.

Page 49: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

32

frequently occurring value) over 100 executions. Recall that a lower value of cohesiveness is

better than a higher value. The cohesiveness measure of FST-K-Means is the lowest in four

out of eight datasets, and it is comparable to the lowest values (obtained by GraClus) for the

remaining 4 datasets.

Table 3.6 shows the improvement of the cohesiveness metric of FST-K-Means relative to

K-Means, and GraClus. The improvement in cohesiveness as a percentage of method A relative

to method B is defined as (J(B)−J(A))×100J(B)

, where J(A) and J(B) denote cohesiveness of

methods A and B respectively; positive values represent improvements while negative values

indicate degradation. FST-K-Means achieves as high as 44.8% improvement in cohesiveness

relative to K-Means and 37.9% improvement relative to GraClus. On average FST-K-Means

realizes an improvement in cohesiveness of 20.2% relative to K-Means and 6.6% relative to

(GraClus).

DatasetsCluster Cohesiveness (J)

K-Means GraClus PCA FSTAdult a2a 24,013 16,665 15,522 16,721australian 4,266 3,034 2,791 2,638breast-cancer 2,475 2,203 980 1,366dna 84,063 65,035 65,029 65,545splice 31,883 31,618 30,830 31,205180txt 25,681 23,776 23,806 24,131300txt 47,235 44,667 44,539 45,05220news 3,851,900 3,483,591 3,029,614 3,341,400

Table 3.5. Cluster cohesiveness of K-Means, GraClus, PCA (top three principal components),and FST.

Page 50: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

33

DatasetsRelative Improvement

(Cohesiveness)K-Means GraClus

(percentage)adult a2a 30.3 -.33australian 38.1 13.0breast-cancer 44.8 37.9dna 22.0 -.78splice 2.1 1.3180txt 6.0 -1.4300txt 4.6 -.8620news 13.2 4.0Average 20.2 6.6

Table 3.6. Improvement (positive values) or degradation (negative values) percentage of cohe-siveness of FST-K-Means relative to K-Means and GraClus.

In summary, the results in Tables 3.2 through 3.6 clearly demonstrate that FST-K-Means

is successful in improving accuracy beyond K-Means at cohesiveness measures that are signif-

icantly better than K-Means and comparable to GraClus. The superior accuracy of K-Means

is related to the effectiveness of the distance-based measure for clustering. Likewise, the su-

perior cohesiveness of GraClus derives from using the combinatorial connectivity measure. We

conjecture that FST-K-Means shows superior accuracy and cohesiveness from a successful com-

bination of both distance and combinatorial measures through FST.

3.4 Toward optimal classification

Unsupervised classification ideally seeks 100% accuracy in labeling entities. However,

as discussed in Section 3.3, accuracy is a subjective measure based on external user specifica-

tions. It is therefore difficult to analyze algorithms based on this metric. On the other hand, the

Page 51: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

34

internal metric, i,e., the cluster cohesiveness, provides an objective function dependent only on

the coordinates (features) of the entities and centroids. Thus, this metric is more amenable for

use in analysis of clustering algorithms

Let the dataset be represented by the N × R sparse matrix A, with N entities and R

features. Let the i-th row vector be denoted by ai,∗ = [a

i,1, a

i,2, · · · , a

i,R]. Now the global

mean of the entities (i.e., the centroid over all entities) is given by the R-dimensional vector a =

1NΣ

N

i=1ai, where the j-th component of a is denoted by a

j= 1

NΣN

i=1ai,j

for j = 1, · · · , R.

Let Y be the centered data matrix, where each column yi,∗ = (a

i,∗ − a)T or equivalently

yi,j

= ai,j− a

jfor i = 1, · · ·N and j = 1, · · ·R.

It has been shown in [47] that for a classification with k clusters, the optimal value of the

cohesiveness can be bounded as shown below, where N × y2 is the trace of Y T

Y and λi

is the

i-th principal eigenvalue of Y TY ;

Ny2 − Σ

k−1i=1

(λi) ≤ J ≤ Ny

2.

Table 3.7 shows the lower and upper bounds for the cohesiveness measure in the original

feature space along with the range observed for this measure (minimum and maximum values)

over 100 trials of FST-K-Means and K-Means. The bounds of the 20news group were excluded

due to computational complexity in calculating eigenvalues, posed by the size of the data. These

results indicate that unlike K-Means, FST-K-Means always achieves near-optimal values for J

and it consistently satisfies the bounds for the optimal cohesiveness. Thus FST-K-Means moves

K-Means toward cohesiveness optimality and consequently toward enhanced classification ac-

curacy.

Page 52: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

35

Cluster Cohesiveness: FST-K-MeansDatasets Lower Upper

Bound Min Max BoundAdult a2a 15,250 15,866 17,208 17,380australian 2,442 2,638 3,008 3,493

breast-cancer 747 1,366 1,366 5,169dna 63,890 65,525 65,865 67,190

splice 30,389 31,205 31,205 31,942180txt 23,188 23,765 24,178 25,290300txt 43,708 44,800 45,194 46,512

Cluster Cohesiveness: K-MeansDatasets Lower Upper

Bound Min Max BoundAdult a2a 15,250 24,013 24,409 17,380australian 2,442 4,266 4,458 3,493

breast-cancer 747 2,475 2,475 5,169dna 63,890 84,062 84,123 67,190

splice 30,389 31,882 31,884 31,942180txt 23,188 25,651 25,730 25,290300txt 43,708 47,220 47,288 46,512

Table 3.7. The lower bound, range and upper bound of cohesiveness across 100 runs of FST-K-Means (top half) and K-Means in the original feature space. Observe that FST-K-Meansconsistently satisfies the optimality bounds while K-Means fails to do so for most datasets. Thehighlighted numbers in the lower table indicate datasets for which the minimum value of cohe-siveness exceeds the upper bound on optimality.

Page 53: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

36

3.5 Chapter Summary

We have developed and evaluated a feature subspace transformation scheme to combine

the advantages of distance-based and graph-based clustering methods to enhance K-Means clus-

tering. However, our transformation is general and as part of our plans for future work we

propose to adapt FST to improve other unsupervised clustering methods.

We empirically demonstrate that our method, FST-K-Means, improves both the accuracy

and cohesiveness relative to popular classification methods like K-Means and GraClus.

We also show that the values of the cohesiveness metric for FST-K-Means consistently

satisfy the optimality criterion (i.e., lie within the theoretical bounds for the optimal value) for

datasets in our test suite. Thus, FST-K-Means can be viewed as moving K-Means toward cohe-

siveness optimality to achieve improved classification.

Page 54: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

37

Fig. 3.2. Plots of classification accuracy and the 1-norm of feature variance vector across FSTiterations. FST iterations are continued until feature variance decreases relative to the previousiteration.

Page 55: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

38

Fig. 3.3. Layout of entities in splice in the original (top) and transformed (bottom) featurespace, projected onto the first three principal components. Observe that two clusters are moredistinct after FST.

Page 56: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

39

(a)

(b)

Fig. 3.4. Layout of entities in two dimensional synthetic dataset in the original (top) and trans-formed (bottom) feature space.

Page 57: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

40

Fig. 3.5. Sensitivity of classification accuracy (P) of K-Means to number of principal compo-nents.

Page 58: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

41

Chapter 4

Similarity Graph Neighborhoodsfor Enhanced Supervised Classification

Supervised classification using Linear Discriminant Analysis [48–50] or support vector

machines [30, 31, 33, 34] are effective machine learning techniques for data classification. Su-

pervised learning algorithms, like LDA or SVM, utilize similarity in the data as a criterion to

build effective classification rules [12]. Such schemes typically use a measure of distance in the

high-dimensional space to quantify this similarity [51]. However, when the number of features

are large, then distance measures can be ambiguous and represent false relationships between

entities [52]. In such scenarios, we need to understand the strength of the similarity to be able to

classify entities correctly. Usually, the neighborhood of an entity plays a critical role in decid-

ing the strength of a similarity [53]. The similarity between entities is typically represented by

the similarity graph whose vertices represent entities and the edge weights indicate the strength

of the similarity between entities. In forming the similarity graph, we are faced with a two-

fold problem, (i) the similarity graph is dense [23], and (ii) computing the similarity graph is

quadratic in the number of observations (N ), which is expensive when N is large. Therefore,

our objective is to determinine a sparse neighborhood of an entity that represents the similarity

relationships most relevant for achieving high classification accuracy.

In this chapter, we propose a data transformation scheme to enhance accuracy of two pop-

ular supervised classification schemes, LDA and SVM. We determine a sparse similarity graph

G from the labeled training data A to reveal a connectivity in the data based on their features.

Page 59: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

42

Subsequently, we utilize local neighborhoods of an entity in this sparse similarity neighbor-

hood graph (SGN) of the training data to transform the high-dimensional feature space to A.

The position of an entity is transformed by a resultant displacement computation using the local

neighborhood. We train a supervised classifier, namely LDA or SVM, on the transformed data

to obtain the classification boundary, which is subsequently used for classification of unlabeled

data in the testing phase. We refer to this combination of SGN data transformation with the LDA

or SVM classifier as SGN-LDA or SGN-SVM, respectively. We evaluate the accuracy of our

schemes SGN-LDA and SGN-SVM on a suite of seven well-known UCI datasets [54] and com-

pare it with a popular implementation of LDA [55] and SVM [30]. Furthermore, we empirically

characterize the sensisitivity of classification accuracy to choices of parameter values in SGN-

SVM and present measures related to the sparsity of the similarity graph neighborhoods. These

measures indicate that the computational costs of our SGN transformations remain proportional

to the size of the original data set.

The remainder of this chapter is organized as follows. In Section 4.1, we describe the

main steps of our SGN transformation and its application to LDA or SVM. We present and

discuss our empirical results in Section 4.2. Section 4.3 presents related research, and Section 4.4

includes a summary of our findings. We would like to note that contents of this chapter are

motivated by our research in [56].

4.1 Exploiting Similarity Graph Neighborhoods for Enhancing SVM Accuracy

In this section, we develop our main contribution, namely the SGN-LDA and SGN-SVM

methods. Key steps in SGN-LDA or SGN-SVM concern: (i) determining a parametrizable γ

Page 60: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

43

entity neighborhood in the high dimensional feature space of training dataset A through spec-

ification of a sparse graph G(B,A), which is our similarity graph, (ii) a training data transfor-

mation scheme to obtain A from the original A, through displacements to entities using γ entity

neighborhoods in the similarity graph G(B,A), (iii) training an LDA or SVM classifier on the

transformed data matrix A, and (iv) classifying the test data. These steps are presented in detail

in the remainder of this section.

4.1.1 Determining γ-Neighborhoods in Similarity Graph G(B,A)

Consider a training dataset A ∈ Rn×r with n entities and r features represented by an

n × r sparse matrix, whose entities have a preassigned label ti∈ −1, 1 ∀ i = 1, · · · , n,

where -1 and 1 each represent a class. The i-th row ai

of the matrix A, represents the feature

vector of the i-th entity in the dataset. Thus, we can view entity i as being embedded in the

high-dimensional feature space (of dimension ≤ r) with coordinates given by the feature vector

ai.

We consider the training data A and form matrix B ∈ Rn×n whose nonzero entries

express similarity strength between two entities ai

and aj

in A. Each element bij

in B is a simi-

larity strength between pairs ai

and aj

computed as shown in Equation 4.1. Function F (ai, a

j)

represents a similarity function of ai

and aj. As an example to do a similarity transform consider

replacing F (ai, a

j) by the Euclidean distance that is given by ||a

i− a

j||2

.

bij

= F (ai, a

j) (4.1)

Page 61: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

44

We view the similarity graph G(B,A) as the adjacency graph representation of B obtained from

dataset A. Computing the similarity function for all (ai, a

j) pairs results in a dense graph; hence,

we focus on maintaining the sparsity of B by preserving the nonzero elements bij

that satisfy

the following two conditions. First, |bij| is greater than ϕ

i, where ϕ

iis a threshold related to the

mean over all elements of row i of B. Second, the entities associated with bij

, i.e., ai

and aj

have q or more features in common. With appropriate choices for ϕ and q, the matrix B will be

relatively sparse. It can then be stored in the compressed sparse row format (CSR) [21], which

stores only non-zero edges, such that neighborhood information can be retrieved in constant,

O(1), time.

In practice, F (ai, a

j) can be replaced by appropriate domain-specific similarity mea-

sures that model similarity in the training data better. For example, in text mining, document

similarity is typically represented as an inner product distance [12]. Therefore, we use the

generic term “similarity” in our description of the method, as our SGN tranformation is not

limited by any particular similarity measure. Figure 4.1 illustrates the process of constructing

the neighborhood graph using a simple example.

Next, we define an entity neighborhood for ai

in the similarity graph G(B,A) as the set

of p entities ajp

, such that ai

and aj

share an edge in G(B,A). We further generalize the

notion of neighborhood by defining a reach set, γ-neighbor(ai), for an entity a

ias follows:

γ-neighbor(ai)a

j: reach(a

i, a

j) ≤ γ, ∀j = 1, . . . , n.. (4.2)

Page 62: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

45

In the equation above, reach(ai, a

j) indicates a path from a

ito a

jin G(B,A) whose length is

no more than γ. Increasing reach with higher γ will typically result in larger neighborhoods;

γ is viewed as a tunable paramater in our method. Figure 4.2 represents a sample γ = 1 entity

neighborhood of an entity ai.

Fig. 4.1. Forming the sparse γ-neighborhood similarity graph G(B,A) from A. F (A) repre-sents the transformation described in Section 4.1.1. G(B,A) is a weighted graph representationof matrix B.

4.1.2 Transforming Training Data through Entity Displacement Vectors.

In this step, we consider transforming training data A to A using a vector measure ob-

tained from γ-neighborhoods in G(B,A). Each entity ai∈ Rr is viewed as embedded in

an r-dimensional feature space, where r ≥ 2. The geometric information for ai

is obtained

from A where aik

represents the coordinate of the i-th entity in the k-th dimension (k ≤ r).

Using this information, we first construct the displacement vector ∆ij

between ai

and aj,

∀aj∈ γ-neighbor(a

i). The displacement vector ∆

ijis computed as the difference between

Page 63: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

46

the vectors aj

and ai. Equation 4.3 represents the displacement vector for entity pair (a

i, a

j).

∆ij

= aj− a

i(4.3)

We multiply the displacement vector by the product titj, where t

i, tj∈ −1, 1 indicate the

class of ai

and aj, respectively. Equation 4.4 represents the modified displacement vector.

∆ij

= titj∆ij

(4.4)

When ai

and aj

belong to the same class, then the product titj

equals 1, while titj

equals -1

when ai

and aj

belong to different classes. Therefore, entities belonging to the same class are

displaced towards each other, and entities belonging to the different classes are displaced away

from each other.

We compute the resultant displacement vector for entity ai

as the weighted sum of all

displacement vectors ∆ij

, j ∈ γ-neighbor(ai). Thus the resultant displacement ρ

iis

ρi

=∑

j∈neighbor(ai)

bij∆ij. (4.5)

For example, in Figure 4.2, ρi

equals bik∆ik+ b

ij∆ij+ b

iu∆iu

+ bip∆ip+ b

iq∆iq

for γ equal

to 1. For γ > 1, we would also include, in the resultant calculation, other entities that are within

a path length of less than or equal to γ from ai.

Subsequently, we seek to update the position of ai

in the direction governed by the unit

vector ρi=

ρi

||ρi||2

. The resultant direction for ai

is typically directed towards the entities with

Page 64: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

47

stronger similarity to ai; therefore, we claim that this direction is the direction in which similarity

between related entities increases. We also introduce a normalizing factor K =√N to scale ρ

i

such that the step size for updating ai

does not increase arbitrarily. We compute the step size

βi

by dividing the magnitude of the resultant displacement ρi

by the normalizing factor K, as

shown in Equation 4.6.

βi

=||ρ

i||2

K(4.6)

Therefore, β is essentially the magnitude of the displacement vector scaled by a constant K to

limit the displacement from growing arbitrarily large.

We update the position of entity ai

in the direction of increasing similarity ρi

with a step

size of βi. Equation 4.7 represents the updated position entity a

i.

ai

= ai+ β

iρi

(4.7)

In our formulation, the rate at which similar entities are moved closer to each other is high;

therefore, we seek to observe an enhanced separation amongst unrelated entities. Furthermore,

similar entities move close to each other while maintaining their positions relative to each other.

Figure 4.3(a) shows the result of our SGN data transformation scheme when applied to

a simple graph. Figure 4.3(b) illustrates obtaining a transformed dataset A from the positions in

the transformed high-dimensional feature space.

Page 65: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

48

4.1.3 Training an LDA or SVM on Transformed Data.

We consider the process of training two supervised classifiers,i.e., LDA and SVM, on the

transformed data.

LDA training. We obtain the class mean and covariance parameter estimates from the trans-

formed data. These estimates are required to construct the discriminant function for the LDA

classifier.

SVM training. We train an SVM [30] (as shown in Equations 2.2 and 2.3) on A to produce

separating hyperplanes characterized by a set of support vectors S. The set S is a subset of

the transformed training data A and marks the separating boundary between the two classes.

Typically, the size of the set S is small relative to n, which is, the size of the training set A.

Using matrix notation, we rewrite Equation 2.3 as shown in Equation 4.8, where Q ∈

Rn×n is typically referred to as the kernel matrix [12], c = ci: c

i∈ (0, 1), ∀i = 1 . . . n,

and α is the vector of Lagrange multipliers [57]. Equation 4.8 seeks to maximize a quadratic

optimization function with respect to the Langrange multipliers α.

maxα≥0

1n c

Tα− 1

2 αTQα (4.8)

In Equation 4.8, often the (i, j)-th entry qij

of Q is represented as the inner product of entities

ai

and aj

(qij

= ai˙aj). However, q

ijcan be formed using other functions of a

iand a

j, such as

the radial basis kernel and many others. This technique, referred to as kernel trick [12], is useful

in many situations where we have prior knowledge about the application that generates the data.

Our selection of the similarity function F (ai, a

j) (or equivalently b

ij) is dependent on

the choice of SVM kernel function. This ensures that the transformation happens using the same

Page 66: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

49

similarity measure (kernel) as used by the SVM optimization problem to avoid any form of

ambiguity. Therefore, our transformation seeks to transform the entity space such that it directly

impacts the kernel matrix Q in the SVM formulation.

4.1.3.1 An Example of SGN Transformation using the Radial Basis Function

In the context of SVM learning, our SGN data transform modifies Q by changing the

position of entities in the high-dimensional feature space. Our SGN transformation brings related

entities closer, thereby decreasing the quantity ||ai− a

j||2

. For the purpose of explaining the

impact of SGN on the Q, we consider that the radial basis function (RBF) based kernel is used

to form Q. Equation 4.9 presents an example in which, each element of matrix Q is computed

using a simple radial basis function.

qij

= e−||a

i−a

j||22 (4.9)

We observe that as ||ai− a

j||2→ 0 the quantity q

ij→ 1, and as ||a

i− a

j||2→ ∞, the

quantity qij→ 0. Modifying Q using SGN, therefore, impacts the properties of Q and the

convex optimization problem defined in Equation 4.8. Figure 4.4 illustrated the SVM training

process using our simple example.

4.1.4 Classifying Test Data

In the case of LDA, the discriminant function that determines the separating boundary is

used to classify test data. In a two-class problem, a test data sample is assigned to the second

class if the log-likelihood ratio (as shown in Equation 2.1) is below a certain threshold τ .

Page 67: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

50

In the case of SVM, the support vectors obtained in the transformed training data A are

directly mapped to the corresponding entities in the original training data A. These mapped

entities form the new support vector for the training data A. For the testing phase, consider a

test set X ∈ Rn×r. The similarity F(xi, sj) of test sample x

iis computed with the set of k

modified support vectors S = sj j = 1 . . . k. Since the set of support vectors is small and

sparse, computing similarity can be performed in constant time. We determine the support vector

sk

that has maximum similarity with xi. We assign x

ithe same class as s

k.

4.2 Evaluation and Discussion

In this section, we first describe our experimental setup followed by evaluation of SGN-

LDA and SGN-SVM first on an artificial dataset then using a set of benchmark datasets. We also

report on sensitivity of our method to the input parameter γ (the number of levels in G(B,A)

used to determine the neighborhood), the sparsity of B, and the dimensions used to form the

dataset.

4.2.1 Experimental Setup and Metrics for Evaluation.

We implemented our SGN scheme using Matlab [55], and a sample implementation is

available for academic purposes. For our experiments, we use radial basis similarity for building

the neighborhood similarity graph. The δi

value is set for each entity ai

as the mean of all

neighbor similarities and any radial basis similarity less than this value are considered to be zero.

We use the LDA implementation as provided in the Matlab Statistics Toolbox [55] and SVM-

Perf [30] as a state-of-art, two-class support vector machine for obtaining the SVM hyperplane.

The base case, that is, training LDA and SVM-Perf on the original training data and recording

Page 68: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

51

the accuracy of classification is referred to as LDA and SVM, respectively. The training error

bound parameter is set to 20. Finally, we use classification accuracy and F1-Score as metrics for

comparing the potential benefits of SGN-SVM over SVM for six benchmark datasets described

in Table 4.3. The classification accuracy is computed as the percentage of correctly classified

test data samples out of the total testing data. The F1-Score is computed as the harmonic mean

of the precision and recall.

4.2.2 Artificial Dataset Results

We now discuss the impact of SGN-LDA and SGN-SVM using an example of the four-

class dataset [58]. This dataset, created by Kleinberg et al. [58], represents artificially created

data in two dimensions, and each attribute is linearly scaled to [−1, 1]. This dataset, that had

four classes originally, was transformed to two classes that are not easily separable as they are

irregularly spread over the space. Table 4.1 provides details of the dataset. Table 4.2 presents

the accuracy obtained by LDA, SGN-LDA, SVM, and SGN-SVM on the fourclass dataset.

Dataset Observations Featuresfourclass 862 2

Table 4.1. Details of the fourclass dataset.

Page 69: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

52

Dataset LDA SGN-LDA SVM SGN-SVMfourclass 73.20 77.91 75.87 94.47

Table 4.2. Classification accuracy (percentage) of LDA, SGN-LDA, SVM, and SGN-SVM forthe fourclass dataset.

4.2.3 Empirical Results on Benchmark Datasets.

We use a suite of six benchmark datasets chosen from the UCI repository [54] for our

experiments. Table 4.3 presents the datasets with the numbers of observations and features in

each dataset. Each dataset is randomly split to form data for training (60%) and for testing

(40%).

Dataset Observations Featuresaustralian 690 14heart 270 13liver 345 6sonar 208 60splice 1,000 60german 1,000 24

Table 4.3. Description of UCI datasets.

Tables 4.4 and 4.5 present classification accuracy and F1-Score results for our six bench-

mark datasets. Figures 4.5 and 4.6 present percentage improvements in classification accuracy

obtained using SGN-LDA and SGN-SVM on the artificial and benchmark datasets. SGN-LDA

improves LDA classification accuracy on average by 5% and the F1-Score by 3.44%. SGN-SVM

performs at par or better than SVM in five out of six datasets with an average improvement in

Page 70: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

53

Dataset Accuracy F1-ScoreLDA SGN-LDA LDA SGN-LDA

australian 86.23 89.86 86.74 89.91heart 84.63 87.96 84.27 88.13liver 64.93 67.25 64.84 66.40sonar 68.67 74.70 70.58 74.80splice 78.25 80.50 78.33 80.50german 71.75 75.50 70.67 71.30

Table 4.4. Classification accuracy (as a percentage) and F1-Score for LDA and SGN-LDA onbenchmark datasets.

accuracy of 4.52% and an improvement in F1-Score of 8.09%. In case of the australian dataset,

SVM outperforms our SGN-SVM. This dataset has a heavily unequal distribution of labeled

data, and any one class is more heavily represented than the other. This adversely affects the

support vectors that SGN-SVM learns, as it is biased towards a particular class of data.

4.2.4 Why does classification improve with our SGN transformation?

Based on our empirical evaluation of SGN-LDA and SGN-SVM presented in Section 4.2,

we observe that our SGN transformation improves classification accuracy. We intuitively believe

that our SGN transformation could potentially help separate clusters that may appear very close

in a high dimensional feature space. In this regard, the question we pose is does increasing

separation between the class centers improve classification accuracy?

Earlier, Hopcroft et al. [59] showed that two points generated by the same Gaussian

process in d-dimensions would be a distance√2d apart, and the distance between two points be-

longing to different Gaussian processes would be√

2d+ δ2, where δ is the separation between

Page 71: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

54

Dataset Accuracy F1-ScoreSVM SGN-SVM SVM SGN-SVM

australian 85.87 82.25 85.88 82.15heart 77.78 79.63 77.51 78.94liver 57.25 63.62 50.26 62.82sonar 79.52 86.27 81.75 86.63splice 73.50 73.75 69.54 74.62german 68.75 69.50 61.51 63.86

Table 4.5. Classification accuracy (as a percentage) and F1-Score for SVM and SGN-SVM onbenchmark datasets.

the two centers. We conjecture that our SGN transformation increases the separation between

the means of the different classes, thus increasing the initial separation δ to a larger δ.

We first demonstrate this using the artificially generated fourclass dataset with respect

to SGN-LDA and SGN-SVM. Figure 4.7 illustrates the separation produced in the training data

by our SGN transformation and the corresponding change in LDA decision boundary. Observe

that elements within a class approximately have the same relative positions. This is an evidence

that our method constructively modifies the training data to obtain better separation and hence

better support vectors. Figures 4.8(a) and (c)(left top, and bottom) illustrate the original training

data, original testing data and the support vectors obtained by SVM. The two classes are rep-

resented by circles and crosses with (green,blue) representing training data and (magenta,cyan)

representing testing data. The black markers represent the support vectors. Figures 4.8(b) and

(d)(right top and bottom) illustrate the original training data, original testing data, and the new

support vectors obtained by SGN-SVM. We observe that our SGN-SVM captures the separating

boundary better than SVM, and our SGN-SVM finds support vectors in critical regions of the

Page 72: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

55

dataset where testing data are prone to be misclassified. Additionally, SGN-SVM also identi-

fies regions where the data are primarily from one class and few support vectors are needed in

this region of the data. In the case of SGN-SVM, the separation in the two classes of data

affects the similarity value between different pairs of samples, consequently, changing the deci-

sion boundary. Alternatively, LDA relies on the sample covariance and mean of the training data.

To explore this in greater detail, we construct training and testing sets from two Gaussian pro-

cesses in four-dimensions. Figure 4.9 visualizes the data in two principal component directions.

Additionally, we show the heatmap of the correlation matrix of the four features. We observe

that for certain pairs of features (indicated by dashed boxes), the correlation between features

change during the SGN transformation, indicating that some features become more influential

than others, consequently explaining the change in decision boundary.

4.2.5 Sensitivity of SGN-SVM Classification Accuracy

We study the effect of varying the value of SGN input parameters on the classification

accuracy of SVM.

4.2.5.1 Effect of Growing or Shrinking the Neighborhood on Classification Accuracy

We study the effect of increasing the neighborhood levels γ on the classification accuracy.

Figure 4.10(a) presents, for three datasets, the accuracy of classification as the γ parameter

increases. We observe that accuracy improvements are obtained with a relatively small number

of levels γ in the range 1 to 3 in G(B,A). This observation indicates that the local structure in a

dataset can lead to improved training and thus improved classification.

Page 73: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

56

4.2.5.2 Effect of Changing Sparsity of G(B,A) on Classification Accuracy

We next study the effect of changing the sparsity of B and the effects a sparser B has on

the classification accuracy for three datasets from our benchmark suite.

In our similarity graph G(B,A), an edge exists between entities ai

and aj

only if q or

more common features are shared between them. As expected, in Figure 4.10(b) we observe that

the number of nonzeros in B, relative to those in A, decreases as q increases. That is, B becomes

sparser for larger values of q. Plots in Figure 4.10(c) indicate that even with higher values of q,

and thus a sparser B, the accuracy of SGN-SVM is not adversely affected.

These observations indicate that our SGN-SVM is not overly sensitive to relatively small

parameter changes.

4.3 Related Research

In this section, we present briefly an overview of related research that uses the influence

of node neighborhood in graphs to improve classification. S. Chakrabarti et al. [60, 61] pro-

pose a greedy graph labeling algorithm that iteratively corrects the labeling by considering the

neighborhood around nodes. In an initial phase, it computes class probabilities for a node us-

ing a Naive Bayes classifier and then re-evaluates the probabilities after the probabilities of the

neighboring nodes are known. Angelova et al. [62] extend and generalize this method using the

theory of Markov random fields to propose a relaxation-based graph labeling technique. This

technique uses different levels of trust in different neighbors by assigning node weights based

on the similarity between neighbors.

Page 74: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

57

Oh et al. [63] propose a method that assigns labels to each node of a graph by considering

the popularity of the node among all of its immediate neighbors.

In a regression model proposed by Lu and Getoor [64], the text features of a document

are used in combination with the labels of immediate neighbors to improve classification. The

feature vector for a node is modified using feature information from either all neighboring nodes

or the most popular node amongst the neighbors.

Castillo et al. [65] address the problem of web spam detection in which they improve

classification accuracy using dependencies among labels of neighboring hosts in the web graph.

Our SGN data transformation derives motivation from the above earlier work and pro-

poses a novel entity feature update technique based on resultant-displacement using neighbor-

hood connectivity. Additionally, we unify our approach with the learning phase of an LDA or

SVM classifier to improve classification.

4.4 Chapter Summary

In this chapter, our main contribution is a γ-neighborhood based training data transforma-

tion scheme (SGN) using similarity graphs to displace entities in high-dimensional feature space.

We unify our SGN data transformation with the learning phase of either a linear discriminant

analysis classifier or a support vector machine to form SGN-LDA or SGN-SVM, respectively.

We show, that on average our SGN-LDA obtains 5.00% better accuracy than traditional LDA,

and SGN-SVM obtains 4.52% better accuracy than traditional SVM on a set of seven datasets.

Additionally, we present analysis on two aspects of our algorithm: (i) the effects of growing

or shrinking the neighborhood size γ, and (ii) the effects of sparsity of the γ-neighborhoods on

Page 75: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

58

classification accuracy. Our analysis shows that our SGN-SVM algorithm improves accuracy

even with relatively sparse similarity graphs and small γ-neighborhoods.

Page 76: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

59

Fig. 4.2. Entities ap, a

q, a

u, a

j, and a

krepresent the immediate neighbors of a

i. A ∆

ijdenotes

the difference between two entity vectors ai

and aj.

Page 77: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

60

Fig. 4.3. (a) Transforming the training data A using G(B,A) to a new graph G(B, (A)). (b) Weobtain new entity coordinates A from the graph G(B, A).

Fig. 4.4. Training an SVM on the transformed data matrix A to obtain separating hyperplanes(shown in white).

Page 78: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

61

Fig. 4.5. Percentage improvements in accuracy and F1-score with SGN-LDA over LDA.

Page 79: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

62

Fig. 4.6. Percentage improvements in accuracy and F1-score with SGN-SVM over SVM.

Page 80: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

63

Fig. 4.7. Illustration of the fourclass dataset after performing our SGN transformation. Observethat the two classes separate out while elements maintain their relative position within their class.A better separating boundary is obtained on this transformed data using LDA (as shown above).

Page 81: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

64

Fig. 4.8. Support vectors obtained (a) using SVM, and (b) using SGN-SVM training for thefourclass dataset in the training data indicating two classes with support vectors shown in black.Illustration of the (c) SVM separating plane, and (d) the SGN-SVM separating plane on testingdata.

Page 82: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

65

Fig. 4.9. Illustration of two four-dimensional Gaussian processes (a) before transformation and(b) after SGN transformation that have been projected to two PCA dimensions for the purpose ofvisualization. In both cases, the boundary is obtained using LDA. Along side are the correlationmatrices before and after the transformation. The dashed boxes indicate the feature pairs thatshowed a significant change in the correlation value after the SGN transformation.

Page 83: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

66

(a)

(b)

(c)

Fig. 4.10. (a) Effect of increasing the entity neighborhood parameter γ in G(B,A) (consider-ing larger neighborhoods) on classification accuracy. When the sparse neighborhood is a goodapproximation indicating similarity, adding elements to the neighborhood does not alter classi-fication accuracy. Effect of varying number of common features q to create the similarity graphB on (b) sparsity of B relative to A and (c) impact on overall classification accuracy.

Page 84: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

Part II

Scalable Geometric Embedding and Partitioning for Parallel

Scientific Computing

67

Page 85: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

68

Chapter 5

Background on Sparse Graph Partitioningand Sparse Linear System Solution

In this part of the dissertation we utilize geometric and combinatorial properties of sparse

graphs to improve quality and performance of parallel scientific computing applications. In

particular, we consider the key problem of parallel sparse graph partitioning and its application in

designing a hybrid solver for sparse linear systems with multiple right-hand sides. In Chapter 6

we develop a scalable parallel geometric partitioner for sparse graphs using a tree-structured

geometric embedding and parallel partitioning approach. In Chapter 7 we show that our tree-

structured geometric embedding and partitioning approach can be used to obtain a fill-reducing

ordering for sparse linear systems. Additionally, in Chapter 7 we show that such a tree-structured

reordering enables developing a direct-iterative hybrid solution approach that improves solver

performance.

The rest of this chapter presents background material required to understand Chapters 6

and 7. Section 5.1 presents notation used in Chapters 6 and 7. Section 5.2.1 presents background

on graph embedding and partitioning. Section 5.3 presents a brief overview of sparse direct and

iterative solvers.

Page 86: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

69

5.1 Notation

We represent a sparse graph with a vertex set V and edge set E as G(V,E). The sizes

of the vertex set and edge set are represented as |V | and |E|, respectively. A vector quantity is

represented as v, a scalar quantity as f , and the two-norm of a vector v as ||v||.

5.2 Graph Embedding and Graph Partitioning

We first provide a brief background on graph embedding and graph partitioning for Chap-

ter 6. Subsequently, we introduce the Charm++ parallel framework that is used to develop our

parallel geometric graph embedding and partitioning scheme.

5.2.1 Graph Embedding

In an early work Eades [66] provides a heuristic for drawing graphs using the force-

directed approach. Kamada and Kawai [67] present a force-directed approach where all ver-

tices are connected by springs with length proportional to the graph distance between vertices.

Fruchterman and Reingold [68] in their method use two types of forces. Edges in the graph

are represented by springs and exert attractive forces, while simultaneously vertices represent

charged particles and experience repulsion. Battista et al. [69] provide a detailed overview of

graph embedding algorithms in their work. In a recent work, Hu [70] proposed a Barnes-Hut

structure of repulsive force computation for the spring-electric model that would reduce compu-

tational cost from O(|V |2) to O(|V |log(|V |)). In an interesting work, Walshaw [71] proposed a

multilevel solution to the graph embedding problem. Harel and Koren [72] have also proposed

Page 87: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

70

a graph embedding algorithm for high-dimensional spaces. Godiyal et al. [73] proposed a fast

implementation of the spring-electric model on GPU architectures using multipole expansions.

5.2.2 Graph Partitioning

Graph partitioning schemes have been studied under three main categories, namely, spec-

tral schemes, multilevel combinatorial schemes, and geometric schemes. Lipton and Tarjan [74]

in an early work proposed a separtor theorem for partitioning planar graphs. Hendrickson and

Leland [75,76] proposed a multilevel spectral bissection based graph partitioning algorithm that

recursively divides the graph based on eigenvalues and eigenvectors (fiedler vector). Karypis

and Kumar [77–79] proposed Metis, a multilevel scheme to partition irregular graph that imple-

ments a V-cycle comprising a coarsensing of the source graph, followed by an initial separation

and then successive refinement [80] and projection. This method was later implemented for

distributed systems and called ParMetis. Heath and Raghavan [17] in their work, proposed a

geometric nested dissection based partitioning scheme for sparse graphs. Miller et al. [16] pro-

posed a geometric partitioning scheme that projects the data to a higher dimensional coordinate

space, obtains a separation in this space and projects the separtor back to the original space. Ad-

ditionally, they proved a theorem to show that the separator obtained in the higher dimensional

space obtains better separation quality.

5.2.3 The Charm++ Parallel Framework

The Charm++ framework for large scale parallel application development provides ab-

stractions that an application programmer can use to decompose the underlying problem into a

Page 88: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

71

set of objects and subsequently model the interactions between them. Objects in a Charm++ sys-

tem, also called chares, are mapped through the runtime system to the physical processor. This

allows the creation of a large number of chares relative to the number of processors that are inde-

pendent of the processors thus facilitating processor virtualization. Additionally, the Charm++

runtime system provides support for sophisticated runtime prefetching and load balancing.

5.3 Sparse Linear Solvers

In this section, we first provide an overview of sparse direct and iterative schemes for

solving sparse linear systems. Subsequently, we provide an overview of the Incomplete Cholesky

preconditioning scheme.

5.3.1 Sparse Direct Solvers

Sparse direct solvers [81–85] compute robust linear systems solutions through sparse

matrix factorizations. Often during factorization, fill-in occurs when zeros in the coefficient ma-

trix become nonzeros. The efficiency of a sparse direct solver is primarily decided by its ability

to control and manage fill-in. Typically, matrix reorderings are applied during factorization to

preserve sparsity of the resulting factor [86–88]. However, the overall memory and computa-

tional costs can grow superlinearly with the dimension of the matrix making it unsuitable for

three-dimensional models [85].

5.3.2 Preconditioned Conjugate Gradient

The Conjugate Gradients(CG) [89] method is a popular iterative method used to solve

linear systems where the coefficient matrix A is symmetric positive definite. In practice, CG

Page 89: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

72

works more effectively when used with a preconditioning scheme to improve convergence. In

practice, the convergence of CG is governed by the conditioning [90–92] of matrix A. That

is, in spite of theoretical guarantees on convergence [93], CG could fail if A is ill-conditioned.

Preconditioning schemes [19, 90, 94–96] can improve overall convergence by representing the

original linear system Ax = b as MAx = Mb, where M is the preconditioner matrix. The

quality of a preconditioner depends mainly on ease of its construction, its application, and its

ability to accelerate CG convergence.

5.3.3 Incomplete Cholesky Preconditioning

Incomplete Cholesky (IC) [19, 97] preconditioner L is computed as an approximation to

the sparse Cholesky factor L where A = LLT . The Cholesky factor L is substantially dense, as

it incurs fill-in, i.e., there are zeroes in the original matrix A that become nonzeros during factor-

ization [85]. The incomplete factor L is obtained by eliminating fill-in from L using two popular

methods, (i) Incomplete Cholesky with level-of-fill (IC(k)), and (ii) Incomplete Cholesky with

drop-threshold (ICT). An IC(k) level-of-fill preconditioner [19] retains those nonzeros in the

factor L that are within a path length of k + 1 from any node in the adjacency graph represen-

tation of A. An ICT [19] preconditioner uses a drop-threshold δ to decide which elements to

drop in the factor L, that is, for δ = 10−2, all nonzero elements in L less than δ are dropped to

form L. The ICT preconditioner provides greater flexibility in creating an incomplete Cholesky

preconditioner. The IC preconditioner is applied to CG using two successive triangular solutions

in every CG iteration.

Page 90: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

73

Algorithm 2 procedure PCG(A,b,M ,x0

,tol)

r0= b−Ax

0

z0= M

−1r0

p0= z

0k = 0repeat

αk=

rT

kzk

pT

kAp

kxk+1

= xk+ α

kpk

if ||rk+1|| < tol then

exitend ifzk+1

= M−1

rk+1

βk=

zT

k+1rk+1

zT

krk

pk+1

= zk+1

+ βkpk

k = k + 1until ||r

k+1|| < tol

Page 91: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

74

5.4 Chapter Summary

In this chapter, we presented background on specific scientific computing algorithms.

Section 5.2.3 provides an overview of the Charm++ system. In Sections 5.3.1, 5.3.2, and 5.3.3

we provided a brief overview of sparse direct methods, the preconditioned conjugate gradient

scheme, and incomplete Cholesky preconditioning.

Page 92: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

75

Chapter 6

Parallel Geometric Partitioningthrough Sparse Graph Embedding

Many problems in large scale scientific simulations involving partial differential equa-

tions can be modeled using sparse graphs. Typically, solving such problems at scale in a dis-

tributed environment requires a domain decomposition scheme that can reduce communication

across processing elements. Sparse graph partitioning, a key step in domain decomposition,

seeks to divide the sparse representation of a computationally intensive problem into a set of

subproblems with low interdependencies.

Graph partitioning research can be grouped into three broad categories, (i) geometric

techniques [16, 98], (ii) combinatorial multilevel techniques [77–79, 99, 100], and (ii) parallel

distributed graph partitioning [78, 101] techniques that balance tradeoffs between quality, par-

allelism, and performance. Parallel combinatorial schemes are effective in reducing edge cuts

between partitions but scaling them in the petascale and exascale regimes with thousands of pro-

cessors can be a challenging problem. Alternatively, geometric schemes are considered highly

data parallel and amenable to scaling, but their partition quality is largely dependent on the initial

coordinate layout of the graph. Although, modeling and simulation applications involving finite

elements or finite-differences possess an underlying vertex coordinate structure, it is not nec-

essary for sparse graphs arising from other applications. Thus, existing geometric partitioning

methods cannot be applied for applications that do not possess an underlying geometry for the

sparse graph. The goal of our framework is to develop a scalable graph partitioning scheme for a

Page 93: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

76

generic class of graph problems by combining sparse graph embedding with parallel geometric

partitioning.

In this chapter, we develop ScalaPart, a parallel graph embedding enabled geometric par-

titioning scheme using the Charm++ [15] parallel programming system. In particular, ScalaPart

is a tree-structured graph embedding combined with a geometric partitioning scheme designed to

address the challenge of scalable graph partitioning in multicore, multiprocessor environments.

First, ScalaPart obtains a coordinate structure for a given sparse graph through a tree-stuctured

parallel graph embedding. Subsequently, this embedding is provided as input to a data parallel

geometric partitioning scheme that seeks to partition G into k subdomains, such that the number

of edges connecting subdomains Gi, . . . , G

kare reduced. In our approach, we augment our par-

allel embedding with our Charm++ data parallel implementation of the geometric partitioning

in [98]. We compare the quality and performance of our method to two popular parallel graph

partitioning schemes, namely PT-Scotch [100] and ParMetis [77–79], on 16 and 32 cores for

a suite of benchmark graphs. Our results indicate that on average we improve quality of the

partitions by 7.4% for ParMetis and are within 3.0% of PT-Scotch. Our ScalaPart approach is,

(i) 92.6% faster than ParMetis and 25.2% faster than PT-Scotch on 16 cores, and is (ii) 97.2%

faster than ParMetis and 11.4% faster than PT-Scotch on 32 cores. Additionally, our embedding

results on 8, 16, and 32 cores show that, on average, our embedding implementation obtains a

relative speedup of 1.73 as the number of cores double.

The remainder of this chapter is organized as follows. Section 6.1 provides background

on graph embedding and graph partitioning. Section 6.2 presents our key contribution, a paral-

lel embedding enabled geometric partitioning. We present our experimental evaluation in Sec-

tion 6.3 followed by a brief conclusion in Section 6.4.

Page 94: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

77

6.1 Background and Related Work

In this section, we discuss previous research in the areas of graph embedding and graph

partitioning. In an early paper Eades [66] provides a heuristic for drawing graphs using the

force-directed approach. Kamada and Kawai [67] present a force-directed approach where all

vertices are connected by springs with lengths proportional to the graph distance between ver-

tices. Fruchterman and Reingold [68] use two types of forces in their method. Edges in the

graph are represented by springs and exert attractive forces, while simultaneously vertices repre-

sent charged particles and experience repulsion. Battista et al. [69] provide a detailed overview

of graph embedding algorithms in their paper. In a recent paper, Hu [70] proposed a Barnes-Hut

structure of repulsive force computation for the spring-electric model that would reduce compu-

tational cost from O(|V |2) to O(|V |log(|V |)). In an interesting paper, Walshaw [71] proposed a

multilevel solution to the graph embedding problem. Harel and Koren [72] have also proposed

a graph embedding algorithm for high-dimensional spaces. Godiyal et al. [73] proposed a fast

implementation of the spring-electric model on GPU architectures using multipole expansions.

Graph partitioning schemes have been studied under three main categories, namely, spec-

tral schemes, multilevel combinatorial schemes, and geometric schemes. Lipton and Tarjan [74]

in an early paper, proposed a separtor theorem for partitioning planar graphs. Hendrickson and

Leland [75,76] proposed a multilevel spectral bissection based graph partitioning algorithm that

recursively divides the graph based on eigenvalues and eigenvectors (namely, the Fiedler vector).

Karypis and Kumar [77–79] proposed Metis, a multilevel scheme to partition an irregular graph

that implements a V-cycle comprising a coarsensing of the source graph, followed by an initial

Page 95: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

78

separation, and then successive refinement [80] and projection. This method was later imple-

mented for distributed systems and called ParMetis. Heath and Raghavan [17] in their paper,

proposed a geometric nested dissection-based partitioning scheme for sparse graphs. Miller et

al. [16] proposed a geometric partitioning scheme that projects the data to a higher-dimensional

coordinate space, obtains a separation in this space, and projects the separtor back to the original

space. Additionally, they proved a theorem to show that the separator obtained in the higher

dimensional space obtains better separation quality than existing adjacency based schemes.

Our scheme is motivated by ChaNGa [102, 103] that develops a massively parallel cos-

mological simulation using the Charm++ parallel programming libraries. Our framework im-

plements a Barnes-Hut tree-structured spring-electric model for highly distributed environments

thus making large-scale graph embedding feasible. Additionally, our framework provides a so-

phisticated interface that makes it easy to develop distributed tree-structured graph embedding

applications. We show that our embedding can potentially help develop scalable geometric graph

partitioning schemes.

6.2 ScalaPart: A Parallel Graph Embedding enabled Scalable Geometric Parti-

tioning

In this section, we present our contribution in the form of a parallel graph embedding

enabled geometric parititoning that first generates a coordinate structure for the vertex set V of a

sparse graph G(V,E) and, subsequently, partitions it through a scalable data-parallel geometric

graph partitioning scheme. Our embedding and the geometric partitioning scheme are designed

using the Charm++ parallel libraries enabling shared and distributed memory parallelism.

Page 96: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

79

Consider a sparse graph G(V,E) with a vertex set V and an edge set E. We define

a |V | × d sparse set X , in which each row xi

of X represents the d-dimensional (d ≥ 2)

coordinate of vertex vi. We develop ScalaPart by combining a scalable graph embedding frame-

work to obtain a d-dimensional geometry for the vertex set V and a data-parallel geometric

partitioning scheme. We first seek to obtain the vertex coordinates through a parallel tree-

structured graph embedding approach. We next seek to partition the vertex set V of G(V,E)

into k subgraphs G1, . . . , G

k using the embedded coordinate structure, such that, the number

of edges connecting these subgraphs is reduced. We develop a data-parallel geometric parti-

tioning scheme [98], also using Charm++, to partition the vertex geometry obtained using our

embedding. Section 6.2.1 presents details of this parallel graph embedding framework, and Sec-

tion 6.2.2 describes the parallel implementation of the geometric partitioning scheme.

6.2.1 Structure of our Charm++ Parallel Graph Embedding

We adapt the Fruchterman-Reingold spring-electric graph embedding model [68] to de-

velop our parallel graph embedding approach. In this model, the vertices are associated to

charged electric particles and edges represent springs that connect these charged particles. The

traditional sequential model, as shown in Algorithm 3, computes two different types of forces

on the vertices: (i) attractive spring forces between adjacent neighbors (fa

) and (ii) repulsive

coulombic forces between all pairs of vertices (fr). Equations 6.1 and 6.2 calculate the attrac-

tive and repulsive forces, where K and C are algorithm parameters.

fa(i, j) =

||xi− x

j||2

K(6.1)

Page 97: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

80

fr(i, j) =

−CK2

||xi− x

j||

(6.2)

As the number of vertices |V | increase, the repulsive force computation, that has a com-

plexity of O(|V |2), dominates the performance in every iteration of the classical sequential

model. Therefore, to decrease overall complexity, tree-structured approaches similar to the

Barnes-Hut algorithm [104] have been proposed [70, 73]. A tree-structured approach reduces

the complexity of the all-pair repulsive force computation from O(|V |2) to O(|V |log(|V |)).

Additionally, such a tree-structured approach naturally enables a parallel implementation of the

algorithm.

We develop a parallel tree-structured implementation of the Fruchterman-Reingold spring-

electric model for embedding vertices of a graph G(V,E) in d-dimensions (d = 3) using the

Charm++ parallel implementation framework [15]. Our implementation derives motivation from

earlier work on cosmological simulation using Charm++ [102,103,105,106]. We define a chare

as a distributed object in the Charm++ system that is capable of performing force computations

on its local set of vertices. Our implementation of the spring electric model using Charm++,

involves three main steps at each iteration (i) distributing the vertices across multiple Charm++

chares or objects, (ii) building an octree decomposition at each chare using vertex positions, and

(iii) calculating forces on vertices using near neighbor and remote neighbor force computations.

Distributing vertices across Charm++ chares. In the Charm++ programming paradigm,

distributed objects are referred to as chares. Consider a distributed programming environment

with P processing elements and T chares with shared and distributed memory parallelism. The

number of chares T is a parameter that is set to at least the number of processing elements P

Page 98: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

81

(T ≥ P ). In our framework, execution initiates through a main chare that is responsible for

creating and allocating resources to other chares. In the first step, the main chare creates a set of

worker chares that are responsible for a set of vertices. Subsequently, each worker chare loads

the adjacency structure and associates a uniformly random starting position in d dimensions to

each vertex in that chare. Henceforth, we restrict d to 3 dimensions; however, in practice d can be

extended to higher dimensions with relative ease within our framework for higher-dimensional

embeddings.

Assigning keys to vertices. Once the initial three-dimensional vertex positions are

loaded, each worker assigns a 64-bit key to each vertex. In our model, we define our three-

coordinate directions as the x, y, and z directions. The 64-bit key is generated based on a

function of the x, y, and z directions. This facilitates the use of coordinate positions to locate

any vertex of G(V,E). These keys are used in the next step to redistribute and load balance

vertices across worker chares.

Sorting and redistributing vertices. Before we start building the octree representation

of the three-dimensional space, vertices need to be sorted based on their positions. The sorting

process has a two-fold purpose which ensures that (a) a particular worker chare is responsible for

vertices in a particular region of the geometry, and (b) each worker has approximately the same

number of vertices. Therefore, the sorting procedure iteratively decides a set of splitter keys that

seek to divide the vertex set into T balanced subsets. At each iteration, this procedure evaluates

a set of splitter keys until it finds a set that meets the required termination criteria. Gioachin et

al. [102] implemented the splitter-based technique in Charm++ for sorting a set of particles in

Page 99: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

82

the context of scalable cosmological simulations.

Building an octree decomposition at each chare. Each worker chare constructs an

octree decomposition of the three-dimensional coordinate space of the vertices. In the octree

representation, a node represents a group of vertices that are bounded within a cube. Each chare

can have two main types of nodes in the octree representation, namely, local and nonlocal. A

local node resides on the chare and is immediately accessible, while a nonlocal node indicates

that it is a placeholder for a remote node and requires communication to fetch the remote node

data.

Calculating forces on particles. We compute attractive forces simply using direct neigh-

bor interactions. The repulsive force computation, that requires all-pair computations, involves

traversing the octree in a depth-first manner to reduce computational costs. Each chare begins

traversal of the octree at the root node. Consider a vertex Vi

for which we need to compute

repulsive forces due to all other vertices. In the octree method, the position of each non-leaf

node is represented by the center of mass its group of vertices. If the distance of Vi

from the

node position is within a user defined limit θ, then this node is traversed further. Otherwise, this

node represents the group of nodes, and repulsive forces are computed between this node and Vi.

Update vertex positions. Each worker updates position of vertices that it owns using the

resultant force information computed in the previous step.

Page 100: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

83

6.2.2 Implementation of a Data Parallel Geometric Partitioning using Charm++

We develop a scalable data parallel geometric partitioning based on [16] using the Charm++

parallel framework. Our parallel implementation of [16] using Charm++ obtains a partitioning

of the vertex coordinates in seven steps. The vertex coordinates obtained from our graph em-

bedding discussed in Section 6.2.1 forms the input to the geometric partitioning algorithm. We

obtain a separator through a series of seven steps.

Shift coordinates. The first step involves moving the vertices such that they are centered

around the origin. This operation is highly parallel and is performed by every worker chare.

Project vertex positions up. We next obtain a stereographic projection of the points

from Rd to ℜd+1 centered around the origin. This operation can be performed by each worker

chare independently.

Compute centerpoint. We compute a center point for the newly projected points. Each

worker chare first computes its local sum of the vertex positions. Subsequently, a collective op-

eration at the main chare finds a global sum and average using the local values.

Obtain conformal map. Finding a conformal mapping involves two steps. First, we ro-

tate the projected points about the origin such that, the centerpoint becomes a point (0, . . . , 0, r)

on the (d+1)-th axis. Subsequently, we obtain a stereographic projection of the rotated points

down to ℜd. We scale all points in ℜd by a factor of√

(1− r)/(1 + r). Finally, we project the

scaled points back to ℜd+1. This part of the computation involves computing a special point

Page 101: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

84

called the radon point [98]. The radon point is computed in a distributed setting by first creating

a random sample of the points at each worker chare. Each worker chare then computes its own

value of a radon point. These local values are communicated to the main chare which computes

the final value of the radon point.

Compute great circle. The main chare obtains a random great circle to divide the set of

points in two sets. This great circle is communicated to all worker chares.

Project vertices down. Each worker chare next converts the great circle to a circle in ℜd

by reversing the dilation, rotation, and projection.

Compute separator from circle. Since our goal is to find an edge separator, the circle

obtained in the previous step can now be used to separate points into two subsets. Points that

intersect with the circle and assigned proportionally to each subset.

The shifting, scaling, higher-dimensional projection operations, and obtaining the cut

require minimum to no communication between chares. Therefore, it would be easy to scale our

geometric graph partitioning to large problem sizes. The next section provides an analysis of the

algorithmic time complexity.

6.2.3 Complexity Analysis

As discussed in Sections 6.2.1 and 6.2.2, our approach consists of a parallel graph layout

and a parallel geometric partitioning.

Page 102: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

85

The parallel algorithmic complexity of the graph layout (Tlayout

) is obtained as the sum-

mation of the time required to compute (a) the attractive forces and (b) the repulsive forces on P

processing elements. The attractive force on each particle that is calculated for each neighbor of

that particle requires O(µ⌈ |V |P

⌉)time, where µ is the average number of neighbors per parti-

cle. The repulsive force computation performs a Barnes-Hut type of tree-based force calculation

in parallel that requires O(⌈ |V |

P

⌉log

⌈ |V |P

⌉)time. Equation 6.3 presents T

layout, where c

1is

a constant.

Tlayout

= c1

⌈|V |P

⌉log

⌈|V |P

⌉+ µ

⌈|V |P

⌉(6.3)

In our parallel geometric partitioning scheme [16], shifting and scaling the node positions

such that they are centered around a zero mean requires O(µ⌈ |V |P

⌉)time on P processing

elements. The centerpoint computation in parallel involves a d-ary tree (d = 6) that requires

O(logP ) time. Subsequently, the conformal mapping and the inertial matrix computation require

a total of 2O(µ⌈ |V |P

⌉)time. Equation 6.4 presents the total parallel time complexity of the

partitioning algorithm, where c2

is a constant.

Tpartition

= c2

⌈|V |P

⌉+ logP (6.4)

6.3 Experiments and Evaluation

In this section, we evaluate the quality of cut and performance of ScalaPart. We first

provide details of our experimental setup and the evaluation metrics used to compare quality and

performance. Subsequently, we present our empirical evaluation of the method and compare it

to two popular graph partitioning schemes.

Page 103: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

86

6.3.1 Experimental Setup and Evaluation Metrics

Our approach is implemented using the Charm++ parallel programming libraries on an

Intel Nehalem cluster. We report quality and performance results on a suite of 26 benchmark

graphs from the University of Florida sparse matrix collection [22]. We compare the quality and

performance of our method with two popular graph partitioning schemes, i.e., ParMetis [78] and

PT-Scotch [101].

Benchmark Matrices. We report our empirical evaluation on a set of 26 benchmark

graphs from the University of Florida sparse matrix collection. Table 6.3.1 presents the details

of these 26 benchmark graphs.

Evaluation Metrics. Consider a graph G(V,E) that is partitioned into k subgraphs

G1, . . . , G

k. Our first evaluation metric, edgecut, determines the quality of the partition by

counting the number of cross edges across the k subgraphs. We define outdegree(Gi, G

j) as the

number of edges crossing from subgraph Gi

to subgraph Gj. Equation 6.5 presents a formal

definition of edgecut EdgecutGk

for k partitions of a graph G.

EdgecutGk

=1

2

k∑i=1

k∑j=1

outdegree(Gi, G

j) (6.5)

6.3.2 Empirical Results

In this section, we evaluate three main aspects of our framework: (i) the performance of

our graph embedding framework, (ii) the quality of partition obtained, and (iii) the performance

Page 104: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

87

of the parallel partitioning also implemented using Charm++. We consider performance on 16

and 32 cores for 16 and 32 partitions.

Graph embedding performance. Figure 6.1 presents the average times required per

iteration to embed a sparse graph using our framework on 8, 16, and 32 cores. Typically, we

execute 200 iterations to obtain a layout that can be used by our partitioning implementation.

Our results indicate, that using our embedding scheme, we obtain an average relative speedup

1.73 as the number of cores double. Both the buildtree operation and the force computations have

low communication volume, and as a result with an increase in the number of cores, individual

processor elements have to perform less work.

Partition quality. In graph partitioning, it is desired that the edgecut obtained after par-

titioning is low. Figures 6.2 and 6.3 report the edgecuts (normalized) obtained using ParMetis,

PT-Scotch, and our scheme for 16 and 32 partitions. In order to normalize the edgecuts we

choose ParMetis as our base method (set to 1) and scale the other two methods accordingly. In

these figures, a lower value indicates a better partition quality. We observe that, on average, our

scheme is 7.4% better than ParMetis and is competitive with PT-Scotch. Partition quality of our

method on average is within 3% of PT-Scotch.

Partitioning performance. We evaluate the performance of ParMetis, PT-Scotch and

our scheme by measuring the execution time of the scheme. Figures 6.4(a) and (b) present the

normalized performance of the three schemes on a single core of a multiprocessor with ParMetis

representing the base method that is set to 1. On a single processor and 16 cuts we observe that

Page 105: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

88

ParMetis is 13.5% and 2.9% faster than PT-Scotch and our scheme, respectively. For 32 cuts, we

observe that ParMetis is 17.8% and 3.8% faster than PT-Scotch and our scheme, respectively.

Figures 6.5(a) and (b) indicate the normalized time (in seconds) of the 3 schemes. ParMetis

is again the base mathod (set to 1), and PT-Scotch and our scheme are scaled accordingly. We

observe that, on average, our scheme is 92.6% faster than ParMetis and is 25.2% faster than PT-

Scotch on 16 cores. When we double the number of cores to 32, our method is 97.2% faster than

ParMetis and, is 11.4% faster than PT-Scotch. Typically, a geometric scheme is more scalable

than a combinatorial scheme, as it is more data parallel and incurs less communication overhead.

6.3.3 Discussion on observed quality and performance

The success of a graph partitioning algorithm is based on its ability to balance the trade-

offs between quality and performance. Earlier partitioning schemes, like ParMetis and PT-

Scotch, take advantage of aggressive coarsening and subsequent refinement strategies to achieve

competitive cuts with reasonable performance. We demonstrate that our scheme could take ad-

vantage of such a multilevel partitioning startegy to reduce the overall time. Our multilevel

approach involves four main steps: (a) coarsen the graph using heavy-edge matching, (b) embed

the coarsened graph in three-dimensions, (c) obtain an initial geometric cut on the coarsened

graph using the embedded coordinates, and (d) project the cut to the finest level and then refine

it. In Figure 6.6 we report the total time required to coarsen the graph, partition the coarsened

graph, and subsequently refine the obtained cut. We observe that ScalaPart is 87.14% faster than

ParMetis and 3.45% faster than PT-Scotch.

Page 106: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

89

However, we show with an example that, for a large class of problems with planar ge-

ometry, multilevel schemes, such as ParMetis and PT-Scotch, may not necessarily result in high

quality cuts. Figures 6.7(a) to (c) present surface cuts obtained using ParMetis, PT-Scotch and

our scheme. The mesh in Figure 6.7 originates from a heat exchanger flow problem. A combina-

torial partitioning scheme such as ParMetis, performs aggressive coarsening based on the graph

structure. However, Figure 6.7(a) clearly indicates that adjacency-based schemes can skew the

cut and result in a higher edgecut. In Figure 6.7(b), although the PT-Scotch cut is skewed, it is

marginally better than ParMetis. In both these cases, the geometry is not known to the algorithm.

In Figure 6.7, we focus on our scheme that provides a geometry based on a 3D embedding of the

sparse graph. We observe that geometric partitioning schemes can benefit from an efficient ge-

ometry that can reduce the size of the edge cut. In particular, our geometric partitioning scheme,

that is modeled using Charm++ based on [16], obtains a better cut, as it has a sense of the dis-

tribution of the points in three-dimensions and benefits by selecting a split point based on this

layout.

6.4 Chapter Summary

In this chapter, we developed a tree-structured parallel graph embedding and geometric

partitioning scheme using the Charm++ parallel programming system. We show that, on average,

our scheme improves edgecuts by 7.4% for ParMetis and is within 3.0% of PT-Scotch. Our

ScalaPart approach is (i) 92.6% faster than ParMetis and is 25.2% faster than PT-Scotch on

16 cores, and is (ii) 97.2% faster than ParMetis and 11.4% faster than PT-Scotch on 32 cores.

Additionally, our embedding results on 8, 16, and 32 cores show that, on average, our embedding

implementation obtains a relative speedup of 1.73 as the number of cores double.

Page 107: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

90

Algorithm 3 SequentialSpringElectricModel(G,x,δ)converged = FALSEstep = initial step lengtht = step scaling constantE =∞ /* Energy */while converged == FALSE do

x0= x

E0= E

E = 0for i ∈ V do

f = 0for j ∈ adjacency(i) do

f = f +fa(i,j)

||xj−x

i||(xj − x

i)

end forfor j = i, j ∈ V do

f = f +fr(i,j)

||xj−x

i||(xj − x

i)

end forxi= x

i+ step ∗ f

||f ||E = E + ||f ||2

end forif E < E

0 thencounter = counter + 1if counter >= 5 then

counter = 0step = step/t

end ifelse

counter = 0step = step ∗ t

end ifif ||x− x

0|| < δ thenconverged = TRUE

end ifend while

Page 108: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

91

Algorithm 4 ParallelSpringElectricModel(G,x,δ)

Charm++ implementation of the spring-electric modelstep = initial step lengtht = step scaling constantE =∞load particles (Worker Chare)assign keys (Worker Chare)for i = 1, 2, . . . ,MAX ITER dosort particles (Worker Chare)build Octree (Worker Chare)compute forces (Worker Chare)for i ∈ V do

f = 0for j ∈ adjacency(i) do

f = f +fa(i,j)

||xj−x

i||(xj − x

i)

end forfor j = i, j ∈ V do

if xj

near neighbor of xi

then

f = f +fr(i,j)

||xj−x

i||(xj − x

i)

elsexcm

=center of mass of far neighbor tree node

f = f +fr(i,j)

|| xcm−x

i||( x

cm− x

i)

end ifend forxi= x

i+ step ∗ f

||f ||E = E + ||f ||2

end forupdate positions (Worker Chare)update step (Main Chare)Gather E from Worker CharesE0= E

E = 0if E < E

0 thencounter = counter + 1if counter >= 5 then

counter = 0step = step/t

end ifelse

counter = 0step = step ∗ t

end ifend for

Page 109: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

92

Graph —V— —E— Description1 linverse 11,999 95,977 statistical2 crystm02 13,965 322,905 materials3 Pres Poisson 14,822 715,804 CFD4 olafu 16,146 1,015,156 structural5 gyro 17,361 1,021,159 optimization6 msc23052 23,052 1,142,686 structural7 aug3d 24,300 69,984 2D/3D8 aug2d 29,008 76,832 2D/3D9 wathen120 36,441 565,761 random

10 mario001 38,434 204,912 2D/3D11 jnlbrng1 40,000 199,200 optimization12 gridgena 48,962 512,084 optimization13 oilpan 73,752 2,148,558 structural14 finan512 74,752 596,992 economic15 cont-201 80,595 438,795 optimization16 denormal 89,400 1,156,224 optimization17 s3dkq4m2 90,449 4,427,725 structural18 s3dkt3m2 90,449 3,686,223 structural19 shipsec1 140,874 3,568,176 structural20 Dubcova3 146,689 3,636,643 2D/3D21 cont-300 180,895 988,195 optimization22 d pretok 182,730 1,641,672 2D/3D23 pwtk 217,918 11,524,432 structural24 Lin 256,000 1,766,400 eigenvalue25 mario002 389,874 2,097,566 2D/3D26 helm2d03 392,257 2,741,935 2D/3D

Table 6.1. Details of benchmark graphs indicating the number of nodes (|V |) and edges (|E|)in the graph with a brief description of the application domain.

Page 110: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

93

Fig. 6.1. Average layout time per iteration for benchmark graphs using 8, 16 and 32 cores. Weinclude results for 8 cores to demonstrate the effect of doubling the number of cores.

Page 111: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

94

Fig. 6.2. Normalized edge cuts for 16 partitions with ParMetis as base set to 1.

Fig. 6.3. Normalized edge cuts for 32 partitions with ParMetis as base set to 1.

Page 112: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

95

(a)

(b)

Fig. 6.4. Normalized time for (a) 16 partitions and (b) 32 partitions on 1 core with ParMetis asbase set to 1.

Page 113: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

96

(a)

(b)

Fig. 6.5. Normalized time for (a) 16 partitions on 16 cores and (b) 32 partitions on 32 coreswith ParMetis as base set to 1.

Page 114: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

97

Fig. 6.6. Normalized time for 32 partitions on 32 cores with ParMetis as base set to 1. The abovenormalized times represent the total time, including time for a coarsening phase, embedding ofthe coarsened graph, partitioning of the coarsened graph, and subsequent refinement of the cut.

Page 115: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

98

(a)

(b)

(c)

Fig. 6.7. A two partition illustration of a graph from a heat exchanger flow problem using (a)ParMetis, (b) PT-Scotch, and (c) our framework. This example demonstrates that a geometricpartitioning scheme clearly achieves a competitive edgecut with a good underlying embeddingof the graph.

Page 116: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

99

Chapter 7

A Multilevel Cholesky ConjugateGradients Hybrid Solverfor Linear Systems with

Multiple Right-hand Sides

The computational simulation of partial differential equation-based models using finite

difference or finite element methods involves the solution of large sparse linear systems [107].

The cost of the linear system solution often dominates overall execution time of such applications

and in many instances a linear system with the same coefficient matrix is solved for a sequence

of right-hand side vectors. In this chapter, we specifically consider the multiple right-hand side

case and seek efficient alternatives to preconditioned conjugate gradient-based solutions when

the coefficient matrix is symmetric positive definite and sparse.

Consider a linear system Ax = b, where A is the coefficient matrix. Sparse solvers

for such systems can be grouped into two broad categories, namely, direct [90,108] using sparse

Cholesky, and iterative [19,94] using conjugate gradients and its preconditioned variants. Sparse

direct solvers are designed to be robust but are mainly limited by large arithmetic and memory

costs that grow superlinearly [85]. Conversely, a sparse iterative solver, such as conjugate gradi-

ent [89], and its preconditioned forms, require only enough memory to store just the coefficient

matrix and a few additional vectors but lacks robustness.

We propose a multilevel hybrid of preconditioned conjugate gradients [89] and Cholesky,

where Cholesky is applied on leaf submatrices and PCG is applied to subsystems corresponding

Page 117: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

100

to higher levels in the tree with partial solutions aggregated and corrected to provide the overall

solution. We expect the setup costs to be higher than traditional PCG; however, these higher

setup costs can be amortized over faster solutions for multiple right-hand sides. In particular,

we seek a hybrid solver that can provide faster solutions than PCG for multiple right-hand sides.

We propose a tree-based substructuring of A in which leaf nodes of the tree represent subma-

trices of A. Nodes at higher levels of the tree represent the recursive coupling between these

submatrices. Our tree-structured partitioning, discussed in Chapter 6, can be used to obtain such

a substructuring of A.

The remainder of this chapter is organized as follows. Section 7.1 develops our key con-

tribution, our tree-structured hybrid solver. Section 7.2 presents our experiments and empirical

evaluation with a brief conclusion in Section 7.3. We would like to note that contents of this

chapter have been obtained from our research in [18].

7.1 A New Multilevel Sparse Cholesky-PCG Hybrid Solver

In this section, we develop a new tree-based sparse hybrid solver. We present the basic

idea by first developing a one-level hybrid solver. Subsequently, we generalize the one-level to a

multilevel hybrid solver with multiple levels of nested solves. In particular, we elaborate on the

solver design for solving a symmetric positive definite sparse linear system.

Consider a sparse linear system Au = b, where A is an n× n sparse symmetric positive

definite matrix (A ∈ Rn×n), b ∈ Rn is an n × 1 column vector, and u ∈ Rn is the n × 1

solution vector. Direct and indirect schemes can both be used to solve this sparse linear system;

however, the choice depends on the tradeoffs between memory demands and robustness. Our

goal is to develop a tree-structured hybrid solution framework that benefits from both direct and

Page 118: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

101

indirect schemes to obtain efficient solutions with low memory overheads for different multiple

right-hand side vectors b. Additionally, a tree-based multilevel method extends easily to parallel

execution environments. In developing this framework, we refer to earlier work on domain

decomposition schemes [20] for sparse direct [109] and sparse iterative solvers [110, 111].

7.1.1 A One-level Hybrid

The zero-nonzero structure of the sparse matrix A enables splitting A into a set of two

disjoint subdomains Ω1

and Ω2

that are separated by another smaller subdomain Γ. Each sub-

domain is a supernode comprising multiple nodes, and this partitioning of A can be interpreted

as a supernodal tree, where Ω1

and Ω2

represent leaf nodes rooted at Γ. The decomposition of

A into a blocked system and its supernodal tree representation form the core of our tree-based

hybrid solver. Section 7.1.1.1 presents an in-depth description of our one-level supernodal tree

construction. Furthermore, in our tree-based representation, smaller subsystems at the leaf nodes

could potentially be solved using a sparse direct solver to obtain a part of the solution vector.

Consequently, the partial solutions could be coupled at the root node of the tree using an iterative

scheme to obtain the complete solution. We provide a detailed description of this scheme for a

one-level tree in Section 7.1.1.2.

7.1.1.1 Obtaining a Tree-structured Aggregate of the Coefficient Matrix A

We split A into two disjoint subdomains connected by a separating subdomain. We

apply the nested dissection [112–115] algorithm to obtain this split; however, a split could be

obtained using any efficient geometric [116] or combinatorial graph partitioning scheme [114].

Earlier such partitioning has been used for developing hybrid preconditioners [117] for sparse

Page 119: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

102

linear systems. We obtain a reordering of A such that (a) rows and columns belonging to each

subdomain are numbered contiguously, and (b) those comprising the separating subdomain are

numbered higher than the disjoint subdomains.

Figures 7.1(a) to (c) represent a one-level nested dissection ordering of a sample matrix

A and the corresponding supernodal tree. In Figure 7.1(c), Σ(0) and Σ(1) represent the disjoint

subdomains. Σ(0 : 1) indicates the subdomain block that separates Σ(0) through Σ(1). Nodes

in Σ(0 : 1) are numbered higher than nodes in both Σ(0) and Σ(1).

Consider a permutation matrix P that represents this tree-structured reordering of A. An

n× n symmetric sparse matrix B is obtained by permuting A using P . Equation 7.1 represents

the permuted matrix B.

B = PAPT (7.1)

7.1.1.2 Constructing a Hybrid Solution Scheme using the Tree-structure

In earlier research on domain decomposition solvers, Mansfield [118] showed that for a

sparse linear system Au = b, if matrix A is reordered to an equivalent blocked representation (as

shown in Equation 7.2) using domain decomposition, then it can be solved efficiently by solving

a set of three linear systems.

B =

B11

BT

21

B21

B22

, (7.2)

Consider blockwise the equivalent linear system Bx = f , where x and f are permu-

tations of u and b corresponding to the permutations for deriving B from A. It is possible to

efficiently solve this system blockwise by making a simple assumption. The assumption is that

Page 120: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

103

x1

, i.e., the first block component of x, can be expressed as the sum of two parts. The block

equation under these assumptions can be written as shown in Equation 7.3.

B11

BT

21

B21

B22

x

1= x

S

1+ x

D

1

x2

=

f1

f2

. (7.3)

This system can be solved for x by first solving for xD1

as shown in Equation 7.5. Equa-

tion 7.4 represents the coefficient matrix S for the subsystem corresponding to x2

.

S =

(B22−B

21B−111

BT

21

)(7.4)

Subsequently, using xD

1, we solve for x

2and x

S

1as shown in Equations 7.6 and 7.7, respectively.

B11xD

1= f

1, (7.5)

Sx2

= f2−B

21xD

1, (7.6)

B11xS

1= −BT

21x2. (7.7)

We derive our motivation from this approach and propose a scheme that is based on our

tree-based restructuring of A (discussed in Section 7.1.1.1). Consider a one-level tree with one

separator as the root and two disjoint subdomains as leaf nodes of the tree. In matrix notation,

we represent this tree structured linear system using Equation 7.8, with the assumptions that

Page 121: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

104

B11xS

1= −BT

31x3

and B22xS

2= −BT

32x3

.

B11

0 BT

31

0 B22

BT

32

B31

B32

B33

x1= x

S

1+ x

D

1

x2= x

S

2+ x

D

2

x3

=

f1

f2

f3

(7.8)

Since B is a symmetric positive definite matrix, a Cholesky decomposition of B can be

represented as B = LLT , where L is the sparse lower triangular factor. Equation 7.9 indicates

the blockwise Cholesky factorization of B.

B11

0 BT

31

0 B22

BT

32

B31

B32

B33

=

L11

0 0

0 L22

0

L31

L32

L33

LT

110 L

T

31

0 LT

22LT

32

0 0 LT

33

(7.9)

The Cholesky factor L can be computed blockwise as shown in Equations 7.10a to 7.10e.

B11

= L11LT

11, (7.10a)

B31

= L31LT

11, (7.10b)

B22

= L22LT

22, (7.10c)

B32

= L32LT

22, (7.10d)

B33

= L31LT

31+ L

32LT

32+ L

33LT

33. (7.10e)

Page 122: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

105

In our supernodal representation, L11

and L22

are computed at the leaf nodes and are

reused to compute L31

and L32

as a sequence of linear system solutions as shown in Equa-

tions 7.11 and 7.12.

L11LT

31= B

T

31(7.11)

L22LT

32= B

T

32(7.12)

The final solution x is obtained using a series of intermediate steps. We first compute xP1

and xP

2as shown in Equations 7.13 and 7.14.

L11LT

11xD

1= f

1(7.13)

L22LT

22xD

2= f

2(7.14)

Equation 7.15a computes the coefficient matrix S for the separator linear system. Rewriting this

expression using the Cholesky factors, yields a simpler system as shown in Equation 7.15b.

S = B33−B

31B−111

BT

31−B

32B−122

BT

32(7.15a)

= L33LT

33(7.15b)

We then solve Equation 7.16 to compute x3

using the PCG scheme with an incomplete cholesky

preconditioner.

Sx3

= f3−B

31xD

1−B

32xD

2(7.16)

Page 123: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

106

Subsequently, we compute xS

1and x

S

2using Equations 7.17 and 7.18.

L11LT

11xS

1= −BT

31x3

(7.17)

L22LT

22xS

2= −BT

32x3

(7.18)

The final solution x is computed by aggregating xD

1, xS

1, xD

2, xS

2, and x

3.

7.1.2 A Multilevel Tree-based Hybrid Solver

We apply our tree-structured reordering procedure recursively to each disjoint subdomain

until the desired number of levels η is reached. The above recursion can be represented in the

form of a supernodal tree where each level in the tree corresponds to one step in the recursion.

Therefore, the disjoint submatrices at the lowest level of the recursion, which form the leaf nodes

of the supernodal tree, are connected by a structured hierarchy of supernodal separators.

In a supernodal tree with multiple levels, we take a bottom-up approach and evaluate

the solution (as discussed above in Section 7.1.1.2) at each subtree. We aggregate the solution

at each subtree and pass it on to one level higher as the tree folds during the solution process.

Figures 7.2(a) to (c) represent a two-level nested dissection ordering of the same sample matrix A

as discussed in Figure 7.1 and the corresponding supernodal tree. In Figure 7.2(c) Σ(0) through

Σ(3) represent the disjoint subdomains. Σ(r : s) indicates the subdomain block that separates

Σ(r) through Σ(s) for r, s ∈ 0, 1, 2, 3. Nodes in Σ(r : s) are numbered higher than nodes in

both Σ(r) and Σ(s). Therefore, Σ(0 : 3) contains the nodes with the highest numbering.

Page 124: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

107

7.1.3 Computational Costs of our Hybrid Solver

Consider a sparse linear system Bx = f , where B is an n× n sparse matrix. We define

µ(B) as the number of nonzeros in matrix A. The computational costs of our hybrid solver

include (a) the cost to setup the Cholesky factors of each subdomain and (b) the cost to solve

each subdomain.

Setup cost (Φsetup). The setup cost is computed as the sum of the square of the number

of nonzeros in each column of the Cholesky factor L of A. L∗,j represents the j-th column of

L. Equation 7.19 provides an estimate of the setup cost.

Φsetup =

n∑j=1

µ2(L∗,j) (7.19)

Solution cost (Φsolve). Consider a supernodal tree representation of A with k levels,

where level zero is at the root of the tree and level k indicates the leafnodes. We define nd, d =

1, . . . , 2k as the number of nodes in d-th subdomain and n

s, s = 1, . . . , (2

k−1) as the number

of nodes in the s-th separator in our supernodal tree. We solve the d-th subdomain at level k

using a sparse direct solver in time τd

. For sparse matrices, τd

is O(n1.5

d

)when the application

domain is two-dimensional and O

(n2

d

)in three-dimensions. We solve the s-th separator block

Bs

using PCG with Incomplete Cholesky preconditioning in O(µBs) + 2× µ(L

s) time, where

Ls

is the IC factor of Bs. Equation 7.20 presents the cost for the tree-structured solution of B.

Φsolve =

2k∑

d=1

τd+

2k−1∑s=1

[O(µ(B

s)) + 2µ(L

s)]

(7.20)

Page 125: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

108

Therefore, the total cost Φhybrid is the summation of the setup cost and the total solution

cost for m right-hand side vectors.

Φhybrid = Φsetup +mΦsolve

7.2 Experiments and Evaluation

In this section, we present our experimental setup, an empirical evaluation of our solver

framework using a suite of benchmark datasets, and discussion of our results.

7.2.1 Experimental Setup

We implemented our hybrid framework in Matlab [41] using Metis [114] to obtain the

nested dissection ordering. We used the sparse Cholesky direct solver to compute the direct

solves and the PCG solver with Incomplete Cholesky preconditioner for iterative solves. We

report results on a suite of benchmark matrices from the University of Florida Sparse Matrix

Collection [1]. We report statistics for a total of 10 different random repeated right-hand side

vectors b.

Metrics for evaluation. We evaluate solver performance using the number of floating

point operations and the solution accuracy using relative error. Additionally, we perform analy-

sis on the sensitivity of our hybrid solver peformance to the number of levels in the supernodal

tree and the number of right-hand sides.

Page 126: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

109

Benchmark Matrices. We evalute our solver on a suite of 12 benchmark matrices from

the University of Florida Sparse Matrix collection [1]. Table 7.1 lists the details of these matrices.

These matrices are typically obtained from discretization of partial differential equations using

finite element or finite difference methods.

Table 7.1. Benchmark matrices from the University of Florida Sparse Matrix Collection [1].Matrix N Nonzeros Descriptionnos7 729 4,617 Poisson’s Equation in Unit Cubebcsstk09 1,083 18,437 Stiffness Matrix - Square Plate Clampedbcsstk10 1,086 22,070 Stiffness Matrix - Buking of Hot Washerbcsstk11 1,473 34,241 Stiffness Matrix - Ore Car (Lumped Masses)bcsstk27 1,224 56,126 Stiffness Matrix - Buckling problem (Andy Mera)bcsstk14 1,806 63,454 Stiffness Matrix - Roof of Omni Coliseum, Atlantabcsstk18 11,948 149,090 Stiffness Matrix - R.E. Ginna Nuclear Power Stationbcsstk16 4,884 290,378 Stiffness Matrix - Corp. of Engineers Damcystem01 4,875 105,339 FEM crystal free vibration mass matrixcystem02 13,965 322,905 FEM crystal free vibration mass matrixs1rmt3m1 5,489 217,651 Matrix from a static analysis of a cylindrical shell.s1rmq4m1 5,489 262,411 Matrix from a static analysis of a cylindrical shell.

7.2.2 Evaluation and Discussion

In Table 7.2 we report the operations count (in millions) required by PCG for our bench-

mark matrices and 10 repeated right-hand side vectors b. A lower count indicates higher per-

formance and the absense of a value indicates that the solution did not converge to the desired

tolerance of 10−8. We report our results for four matrix orderings (Natural, RCM, Nested Dis-

section, and MMD) and two preconditioners (Incomplete Cholesky with zero level-of-fill (IC0)

Page 127: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

110

and drop-threshold (ICT) of 10−2). In Table 7.2, our hybrid direct-iterative solver (Hybrid-DI)

has the lowest operations count in eight out of twelve matrices. On average, our method per-

forms 1.87 times faster than the best PCG and ordering combination. For the bcsstk10 matrix,

the gain in performance is as high as 7.36 times compared to the best PCG performance. How-

ever, for some matrices, such as crystm02, we observe a degradation in performance. For this

class of matrices, it is often difficult to find a compact node separator. Consequently, this leads

to very small blocks (leaf nodes) that are solved using Cholesky and very large separators that

are solved using PCG. Therefore, the coupling between these is not effective.

Table 7.3 presents the relative error in the final solution. We compute relative error

defined as ||x∗ − x||/||x∗||, where x∗ is the true solution for the linear system solution. We

observe that PCG does not converge for a majority of the orderings even with the use of a

preconditioner. However, our tree-based hybrid method is able to produce a solution within an

accuracy of 10−4 or higher.

For a sparse matrix A, we define OpsHybridDI(A) the operation count for our Hybrid-DI

method, OpsBestPCG(A) the operation count for the best PCG performance, and OpsAvgPCG(A)

the average PCG performance for different variants of PCG used in Table 7.2. We compute

Speedupbest(A) as the ratio of OpsBestPCG(A) and OpsHybridDI(A). Additionally, we com-

pute Speedupavg(A) as the ratio of OpsAvgPCG(A) and OpsHybridDI(A).

Figure 7.3(a) presents Speedupbest(A), the speedup obtained by using our hybrid solver

compared to the best PCG performance. Figure 7.3(b) presents Speedupavg(A), the speedup

using Hybrid-DI compared to an average PCG performance across different variants. We ob-

serve that certain matrices, such as bcsstk10, can obtain over seven times the performance gain

by using our tree-based hybrid scheme. However, there are also matrices, such as crystm01 and

Page 128: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

111

crystm02, that are not good candidates for our hybrid scheme and may not benefit from this

method. In particular, this suggests that the success of our methodology is related to the un-

derlying structure and properties of the matrix A, which is an attribute of the application. We

anticipate that setup cost will decrease with multiple levels; however, convergence will be slower

for a given right-hand side vector. Consequently, the overall operation count may increase. How-

ever, for a specific number of right-hand sides, this tradeoff could be exploited to obtain the right

balance. For a problem with few right-hand sides, it may be important to consider a hybrid

solver with relatively small number of levels that also incurs low memory overheads. Figure 7.4

illustrates this using the example of the bcsstk10 matrix. For fewer levels in the tree, the memory

overheads due to sparse factorization could increase setup costs.

Effect of increasing tree-levels on solver performance. We perform a sensitivity study

on six of our best performing matrices from our benchmark suite to understand solver perfor-

mance as we increase tree levels. Figure 7.5(a) indicates the solver performance for 10 right-hand

sides and levels increasing from 1 to 5. We observe that the floating point operations increase

asymtotically with the number of levels. Therefore, it is desired to set the number of levels, such

that the memory demands of the largest subdomain block are satisfied while maintaining lower

setup costs.

Effect of increasing right-hand sides on solver performance. Figure 7.5(b) presents

sensitivity of increasing the number of right-hand sides on the hybrid solver performance and

compares it to the best PCG result for our six best performing matrices. We observe that as the

number of right-hand sides increases, the solution cost dominates the performance. Additionally,

Page 129: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

112

our method sustains higher performance compared to PCG, and speedup increases with a greater

number of right-hand sides.

7.3 Chapter Summary

In this chapter, we developed a multilevel tree-based Cholesky-PCG hybrid solver that

solves a linear system Ax = b with the same coefficient matrix but multiple right-hand sides.

For our test matrices, we show that our hybrid approach for obtaining a sparse linear system

solution is 1.87 times faster than the best performing PCG solution for multiple right-hand sides.

We believe that such a hybrid solution framework could maintain the accuracy of a sparse direct

solution while reducing memory demands. Additionally, a tree-structured approach enables us

to consider the development of a parallel implementation of our hybrid approach.

Page 130: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

113

(a)

(b)

(c)

Fig. 7.1. (a) Matrix bcsstk11 with natural ordering; (b) a one-level nested dissection orderingof bcsstk11; (c) a supernodal tree represention of the one-level ordering.

Page 131: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

114

(a)

(b)

(c)

Fig. 7.2. (a) Matrix bcsstk11 with natural ordering (b) A two-level nested dissection orderingof bcsstk11 (c) A supernodal tree represention of the two-level ordering.

Page 132: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

115

Table 7.2. Operation counts (in millions) for 10 repeated right-hand side vectors and PCGwith IC Level-of-fill (IC0) and drop-theshold (ICT) preconditioner using (a) natural ordering(NAT), (b) RCM ordering, and (c) nested dissection (ND) ordering, and (d) minimum degree(MMD) ordering. Hybrid-DI represents our tree-based hybrid solver framework. Values in boldrepresent the best performance.

MatrixNAT RCM ND MMD

Hybrid-DIIC0 ICT IC0 ICT IC0 ICT IC0 ICT

nos7 7.73 1.97 7.72 12.21 - 32.37 - 32.56 2.95bcsstk09 - 11.60 - 583.60 - 687.51 - - 10.42bcsstk10 - 35.54 - - - - - - 4.83bcsstk11 - - - - - - - - 16.08bcsstk27 - 11.51 - 579.48 - - - - 10.42bcsstk14 - 55.21 - - - - - - 24.15bcsstk18 - 985.15 - - - - - - 247.75bcsstk16 986.89 144.01 1,180.21 1,257.05 - 2,128.73 - 2,005.01 275.54crystm01 30.51 5.27 30.51 24.13 - 58.63 - 69.66 104.10crystm02 93.42 10.82 93.43 73.46 - 193.60 - 190.17 911.31s1rmt3m1 - 432.19 - 5,795.08 - - - - 228.93s1rmq4m1 - 507.84 - 2,079.73 - - - - 311.53

Table 7.3. Relative error for 10 repeated right-hand side vectors and PCG with IC Level-of-fill (IC0) and drop-theshold (ICT) preconditioner using (a) natural ordering (NAT), (b) RCMordering, and (c) nested dissection (ND) ordering, and (d) minimum degree (MMD) ordering.Hybrid-DI represents our tree-based hybrid solver framework.

MatrixNAT RCM ND MMD

HybridIC0 ICT IC0 ICT IC0 ICT IC0 ICT

nos7 4.8E-06 1.3E-06 5.2E-06 2.0E-06 - 7.0E-06 - 1.6E-06 1.7E-04bcsstk09 - 0.0E+00 - 0.0E+00 - - - - 0.0E+00bcsstk10 - 9.5E-06 - - - - - - 0.0E+00bcsstk11 - - - - - - - - 2.6E-05bcsstk27 - 0.0E+00 - 0.0E+00 - - - - 0.0E+00bcsstk14 - 6.1E-06 - - - - - - 1.9E-06bcsstk16 0.0E+00 0.0E+00 0.0E+00 0.0E+00 - 0.0E+00 - 0.0E+00 0.0E+00bcsstk18 - 4.2E-05 - - - - - - 3.1E-05crystm01 0.0E+00 0.0E+00 0.0E+00 0.0E+00 - 0.0E+00 - 0.0E+00 0.0E+00crystm02 0.0E+00 0.0E+00 0.0E+00 0.0E+00 - 0.0E+00 - 0.0E+00 0.0E+00s1rmt3m1 - 4.1E-06 - 6.2E-04 - - - - 2.8E-06s1rmq4m1 - 4.1E-06 - 5.0E-06 - - - - 2.4E-06

Page 133: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

116

(a)

(b)

Fig. 7.3. Speedup obtained by our method over (a) best PCG and ordering combination, and (b)average PCG performance across different variants for 10 right-hand sides. A speedup greaterthan 1 indicates improvement.

Page 134: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

117

Fig. 7.4. Impact on performance with increase in tree levels and number of right-hand sides forthe bcsstk10 matrix.

Page 135: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

118

(a)

(b)

Fig. 7.5. Effect of increasing (a) tree levels, and (b) right-hand sides on operations count (inmillions) of our Hybrid solver compared to the best PCG performance for 10 right-hand sides.

Page 136: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

119

Chapter 8

Discussion

We conclude this dissertation by discussing our major research findings, associated open

problems, and possible extensions to our work on combining geometry and structure for large

scale data analysis and parallel scientific computing. Our work is motivated by the fact that

research in many areas of science and engineering is increasing relying on data-driven compu-

tational modeling and analysis. The underlying data in these applications is typically sparse and

possesses a high-dimensional geometry. Additionally, the sparsity of the data can be represented

as a graph. The key challenge is to develop hybrid scalable approaches that can utilize the spar-

sity of the graph representation of the data to manipulate the high-dimensional geometry. In this

dissertation, we approached this problem in two parts with an application perspective. In the

first part, we considered improvements for the classification problem in data analysis. In the

second part, we focused on scalable domain decomposition for parallel scientific computing on

multicore-multiprocessors.

In the first part of this dissertation, we begin by focusing on unsupervised classification

and then on supervised classification. Typically, applications that do not possess prelabeled ob-

servations have to rely on unsupervised classification to understand a grouping in the data. In

Chapter 3, we developed a new feature subspace transformation scheme that iteratively combines

geometry-based distance measures with graph-based measures to enhance K-Means clustering.

Page 137: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

120

Our FST scheme first forms a sparse entity-to-entity weighted graph indicating sparse relation-

ships in the data. We then transform the geometry of the data iteratively by bringing related

entities in the graph close together. This transformed data is subsequently clustered using a

traditional K-Means implementation. The strength of a relationship between two entities is di-

rectly proportional to the edge weight in the entity-to-entity graph and depends inversely on the

geometric separation between related entities in the high-dimensional space.

Since we seek to bring related entities closer, the strength of the relationship is higher

when related entities are far apart and gradually decreases as the algorithm iterates. It is an open

research problem to determine a domain specific similarity function that best describes the entity

relationships. Our results indicate that on average FST-K-Means improves accuracy by 14.9%

relative to K-Means and by 23.6% relative to multilevel K-Means (GraClus). Furthermore, FST-

K-Means achieves upto 44.8% improvement in cluster cohesiveness relative to K-Means and

upto 37.9% improvement relative to GraClus. Additionally, we investigate the cohesiveness of

the obtained clusters and show that our FST-K-Means algorithm consistently satisfies the opti-

mality criterion for cluster cohesiveness (i.e., it lies within theoretical upper and lower bounds).

Another interesting open research problem is whether FST can be combined with other cluster-

ing methods to achieve similar gains. We plan to address open problems concerning this research

in our future work.

We next consider transformations for enhancing supervised classification with prelabeled

data. In Chapter 4, we developed a data transformation for prelabeled data that utilizes local

neighborhoods in the similarity neighborhood graph of the training data to manipulate the high-

dimensional geometry. Our Similarity Graph Neighborhood (SGN) approach transforms the

geometry of the training data by applying displacements to entities based on their neighborhood.

Page 138: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

121

This procedure can be viewed as a smoothing of the training data in order to obtain a better

separation. In the presence of nonlinearities in the data, an open problem associated with this

approach is determining appropriate sparse entity neighborhood. In our current scheme, we con-

sider level-based nearest neighbors (a γ-neighborhood) to define a neighborhood for an entity.

However, it would be interesting to consider neighborhoods adaptively based on aproaches, such

as the node degree of the entity or density of nodes in different regions of the high-dimensional

space. In our SGN classifer, the boundary obtained in the transformed space is mapped to

the original space for subsequently classifying test data. Our SGN transformation, on average,

enhances the quality of a Linear Discriminant (LD) classifier by 5.0% and a Support Vector Ma-

chine (SVM) classifier by 4.52%. We believe that our SGN approach would be more effective

if the test data is projected and tested in the transformed space. However, developing such a

projection for our SGN transformation is an open research problem. We plan to address these

problems related to the SGN transformation in our future work.

In the second part of this dissertation, we address the challenge of developing a scalable

graph partitioning scheme by combining geometric and structural properties of sparse graphs.

In Chapter 6, we develop ScalaPart as a parallel graph embedding approach coupled with a

scalable geometric partitioning scheme. In earlier chapters in Part I, the data originally had a

high-dimensional geometry that we could manipulate using the sparse graph structure. How-

ever, ScalaPart begins with a sparse graph structure and a uniformly random layout and itera-

tively converges to a meaningful geometric layout. This layout can subsequently be partitioned

by a scalable geometric partitioning scheme. We developed the embedding enabled partitioning

approach using the Charm++ parallel programming system to achieve scalability. Our embed-

ding results indicate that on average we achieve a 1.78 times speedup as the number of cores

Page 139: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

122

double. The partition quality obtained using our implementation of the geometric partitioning

scheme in Charm++ is, on average, 7.4% better than ParMetis and remains within 3.0% of PT-

Scotch. Our geometric partitioning scheme in ScalaPart performs 92.6% faster than ParMetis

and 25.2% faster than PT-Scotch on 16 cores. On 32 cores, our scheme is 97.2% faster than

ParMetis and 11.4% faster than PT-Scotch. We believe that as the number of cores increases, our

scheme will be more scalable than established schemes such as, ParMetis or PT-Scotch, since a

geometric partitioning approach is highly data-parallel. However, we would need to verify our

claim through experiments on a larger number of cores, which remains an open question. Since

our framework is developed using the Charm++ libraries, it is also independent of the underly-

ing MPI (or OpenMP) implementation, thus making it easily portable to other architectures. We

plan to address open problems related to ScalaPart in our future work.

In Chapter 7, we developed a multilevel tree-based Cholesky-PCG hybrid solver that

solves a linear system Ax = b with the same coefficient matrix but multiple right-hand sides.

The first step in our hybrid approach obtains a tree-structured recursive partitioning of A into

groups of blocks and their separators. We believe that our ScalaPart framework could be easily

adapted to obtain this tree-structured domain decomposition using the geometry of A. However,

an important research problem is to determine an optimal depth of this tree that can balance

tradeoffs between computation and communication. We demonstrate, on a suite of benchmark

matrices, that our hybrid approach for obtaining a sparse linear system solution is 1.87 times

faster than the best performing PCG solution for multiple right-hand sides. We believe that

such a hybrid solution framework could maintain the accuracy of a sparse direct solution while

reducing memory demands on multicore-multiprocessors. We address open problems related to

this problem in our future work.

Page 140: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

123

Over the next few years, we plan to continue doing research on sparsity-aware algo-

rithms for data mining and parallel scientific computing in multicore-multiprocessor environ-

ments. We believe that scientific algorithms for high-dimensional data will need to utilize both

sparse structure and high-dimensional geometry to achieve precise scalable models. Some in-

teresting research problems that we would like to investigate include scalable data mining for

heterogeneous architectures, analysis of large time varying sparse graphs, and scalable solvers

for exascale platforms.

Page 141: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

124

References

[1] T. Davis. University of Florida Sparse Matrix Collection. URL:

http://www.cise.ufl.edu/research/sparse/matrices/.

[2] M. W. Berry. Survey of Text Mining I: Clustering, Classification, and Retrieval. Springer,

2003.

[3] N. D. Lawrence, M Girolami, M Rattray, and G Sanguinetti. Learning and Inference in

Computational Systems Biology. MIT Press, 2010.

[4] T. J. Loredo. The promise of bayesian inference for astrophysics, 1992.

[5] K. Rajan. Combinatorial materials sciences: Experimental strategies for accelerated

knowledge discovery. Annual Review of Materials Research, 38(1):299–322, 2008.

[6] A. Chatterjee, S. Bhowmick, and P. Raghavan. Feature subspace transformations for

enhancing k-means clustering. In CIKM, pages 1801–1804, 2010.

[7] J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28(1),

1979.

[8] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a mul-

tilevel approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on,

29(11):1944–1957, Nov. 2007.

Page 142: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

125

[9] I. S. Dhillon, Y. Guan, and B. Kulis. A fast kernel-based multilevel algorithm for graph

clustering. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international con-

ference on Knowledge discovery in data mining, pages 629–634, New York, NY, USA,

2005. ACM.

[10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized

cuts. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on

Knowledge discovery and data mining, pages 551–556, New York, NY, USA, 2004. ACM

Press.

[11] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement.

Software - Practice and Experience, 21(11):1129–1164, 1991.

[12] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and

Statistics). Springer, August 2006.

[13] V. N. Vapnik. The Nature of Statistical Learning Theory (Information Science and Statis-

tics). Springer, November 1999.

[14] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data

Mining and Knowledge Discovery, 2:121–167, 1998.

[15] L. V. Kale and S. Krishnan. CHARM++: a portable concurrent object oriented system

based on C++. In Proceedings of the eighth annual conference on Object-oriented pro-

gramming systems, languages, and applications, OOPSLA ’93, pages 91–108, New York,

NY, USA, 1993. ACM.

Page 143: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

126

[16] J. R. Gilbert, G. L. Miller, and S. Teng. Geometric mesh partitioning: Implementation and

experiments. In In Proceedings of International Parallel Processing Symposium, pages

418–427, 1995.

[17] M. T. Heath and P. Raghavan. A cartesian parallel nested dissection algorithm, 1995.

[18] J. D. Booth, A. Chatterjee, P. Raghavan, and M. Frasca. A multilevel cholesky conju-

gate gradients hybrid solver for linear systems with multiple right-hand sides. Procedia

Computer Science, 4:2307 – 2316, 2011. Proceedings of the International Conference on

Computational Science, ICCS 2011.

[19] Y. Saad. Iterative Methods for Sparse Linears Systems. PWS Publishing Co., Boston,

MA, 1996.

[20] T. F. Chan and B. Smith. Domain decomposition and multigrid algorithms for elliptic

problems on unstructured meshes. Technical Report CAM 93–42, University of California

at Los Angeles, 1993.

[21] A. George and J. Liu. Computer Solution of Large Sparse Positive Definite Systems.

Prentice Hall, 1981.

[22] T. A. Davis. University of florida sparse matrix collection. NA Digest, 92, 1994.

[23] J. C. Platt. Fast embedding of sparse music similarity graphs. In Advances in Neural

Information Processing Systems, page 2004. MIT Press, 2004.

[24] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning:Data

Mining, Inference, and Prediction. Springer-Verlag, 2001.

Page 144: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

127

[25] H. Steinhaus. Sur la division des corp materials en parties. Bulletin of Acad. Polon. Sci,

IV (C1. III):801–804, 1956.

[26] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory,

28:129–137, 1982.

[27] G. Ball and D. Hall. Isodata, a novel method of data analysis and pattern classification.

Tech. report NTIS AD 699616, Stanford Research Institute, Stanford, CA, 1965.

[28] J. MacQueen. Some methods for classification and analysis of multivariate observations.

Fifth Berkeley Symposium on Mathematics, Statistics and Probability, University of Cali-

fornia Press, 1967.

[29] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm, 2001.

[30] Thorsten Joachims. Training linear svms in linear time. In KDD ’06: Proceedings of the

12th ACM SIGKDD international conference on Knowledge discovery and data mining,

pages 217–226, New York, NY, USA, 2006. ACM.

[31] C. Chang and C. J. Lin. LIBSVM: A library for support vector machines, 2011.

[32] M. C. Ferris and T. S. Munson. Interior point methods for massive support vector ma-

chines. SIAM Journal on Optimization, 13:783–804, 2003.

[33] O. L. Mangasarian and David R. Musicant. Lagrangian support vector machines. J. Mach.

Learn. Res., 1:161–177, 2001.

[34] E. Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. PSVM: Parallelizing

support vector machines on distributed computers. In NIPS, 2007.

Page 145: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

128

[35] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, August 1999.

[36] I. Bıro, J. Szabo, and A.A. Benczur. Latent dirichlet allocation in web spam filtering. In

AIRWeb ’08: Proceedings of the 4th international workshop on Adversarial information

retrieval on the web, pages 29–32, New York, NY, USA, 2008. ACM.

[37] X. Li. A volume segmentation algorithm for medical image based on k-means cluster-

ing. In IIH-MSP ’08: Proceedings of the 2008 International Conference on Intelligent

Information Hiding and Multimedia Signal Processing, pages 881–884, Washington, DC,

USA, 2008. IEEE Computer Society.

[38] K. Kim and H. Ahn. A recommender system using ga k-means clustering in an online

shopping market. Expert Systems with Applications, 34(2):1200 – 1209, 2008.

[39] D. Arthur and S. Vassilvitskii. How slow is the k-means method? In SCG ’06: Proceed-

ings of the twenty-second annual symposium on Computational geometry, pages 144–153,

New York, NY, USA, 2006. ACM.

[40] S. M. Savaresi, D. L. Boley, S. Bittanti, and G. Gazzaniga. Cluster selection in divisive

clustering algorithms. In proceeding SIAM Datamining Conference, Arlington, VA, 2002.

[41] R. Neal. Assessing relevance determination methods using delve. In Neural Networks

and Machine Learning, pages 97–129. Springer-Verlag, 1998.

[42] The MathWorks Inc. Matlab and simulink for technical computing, 2007.

http://www.mathworks.com.

[43] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.

http://www.ics.uci.edu/∼mlearn/MLRepository.html.

Page 146: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

129

[44] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine learning, neural and statistical

classification, 1994.

[45] G. Salton. Smart data set, 1971. ftp://ftp.cs.cornell.edu/pub/smart.

[46] K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth Interna-

tional Conference on Machine Learning, pages 331–339, 1995.

[47] L. Douglas Baker and Andrew Kachites McCallum. Distributional clustering of words

for text classification. In SIGIR ’98: Proceedings of the 21st annual international ACM

SIGIR conference on Research and development in information retrieval, pages 96–103,

New York, NY, USA, 1998. ACM.

[48] C. Ding and X. He. K-means clustering via principal component analysis. pages 225–232.

ACM Press, 2004.

[49] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals Eugen.,

7:179–188, 1936.

[50] J. H. Friedman. Regularized discriminant analysis. Journal of the American Statistical

Association, 84:165–175, 1989.

[51] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K. R. Mullers. Fisher discriminant

analysis with kernels. pages 41–48. IEEE, 1999.

[52] M. Aizerman, E. Braverman, , and L. Rozonoer. Theoretical foundations of the potential

function method in pattern recognition learning. Automation and Remote Control, page

821837, 1964.

Page 147: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

130

[53] T. Oommen, D. Misra, N. K. C. Twarakavi, A. Prakash, B. Sahoo, and S. Bandopadhyay.

An objective analysis of support vector machine based classification for remote sensing.

Mathematical Geosciences, 40(4):409, 2008.

[54] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly

detection in bipartite graphs. In proceedings ICDM, pages 418–425, 2005.

[55] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.

[56] MathWorks. Matlab and simulink for technical computing, 2007.

[57] A. Chatterjee and P. Raghavan. Similarity graph neighborhoods for enhanced supervised

classification. In Review, 2011.

[58] T. K. Ho and E. M. Kleinberg. Building projectable classifiers of arbitrary complexity.

Pattern Recognition, International Conference on, 2:880, 1996.

[59] J. E. Hopcroft, S. Soundarajan, and L. Wang. The future of computer science. Interna-

tional Journal of Software and Informatics (IJSI), 2011.

[60] Soumen Chakrabarti. Mining the Web: Discovering Knowledge from HyperText Data.

Sci. & Tech. Books, 2002.

[61] Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization

using hyperlinks. In SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international

conference on Management of data, pages 307–318, New York, NY, USA, 1998. ACM.

[62] R. Angelova and G. Weikum. Graph-based text classification: Learn from your neighbors.

In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on

Page 148: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

131

Research and development in information retrieval, pages 485–492, New York, NY, USA,

2006. ACM.

[63] H. J. Oh, S. H. Myaeng, and M. H. Lee. A practical hypertext catergorization method

using links and incrementally available class information. In SIGIR ’00: Proceedings of

the 23rd annual international ACM SIGIR conference on Research and development in

information retrieval, pages 264–271, New York, NY, USA, 2000. ACM.

[64] L. Getoor. Link mining: A new data mining challenge. SIGKDD Explor. Newsl., 5(1):84–

89, 2003.

[65] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors:

Web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th

annual international ACM SIGIR conference on Research and development in information

retrieval, pages 423–430, New York, NY, USA, 2007. ACM.

[66] P. A. Eades. A heuristic for graph drawing. In Congressus Numerantium, volume 42,

pages 149–160, 1984.

[67] T. Kamada and S. Kawai. An algorithm for drawing general undirected graphs. Inf.

Process. Lett., 31:7–15, April 1989.

[68] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement.

Software - Practice and Experience, 21(11):1129–1164, 1991.

[69] G. D. Battista, P. Eades, R. Tamassia, and I. G. Tollis. Graph drawing: Algorithms for the

visualization of graphs. 1999.

Page 149: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

132

[70] Y. F. Hu. Efficient, high-quality force-directed graph drawing. The Mathematica Journal,

10:37–71, 2006.

[71] C. Walshaw. A multilevel algorithm for force-directed graph drawing. In Proceedings of

the 8th International Symposium on Graph Drawing, GD ’00, pages 171–182, London,

UK, 2001. Springer-Verlag.

[72] D. Harel and Y. Koren. Graph drawing by high-dimensional embedding. In Revised

Papers from the 10th International Symposium on Graph Drawing, GD ’02, pages 207–

219, London, UK, 2002. Springer-Verlag.

[73] A. Godiyal, J. Hoberock, M. Garland, and J. C. Hart. Graph drawing. chapter Rapid Mul-

tipole Graph Drawing on the GPU, pages 90–101. Springer-Verlag, Berlin, Heidelberg,

2009.

[74] R.J. Lipton and R.E. Tarjan. A separator theorem for planar graphs. SIAM J. Appl. Math.,

36:177–199, 1979.

[75] B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. Technical

Report SAND93-1301, Sandia National Laboratories, Albuquerque, NM 87185, 1993.

[76] B. Hendrickson and R. Leland. An improved spectral graph partitioning algorithm for

mapping parallel computations. SIAM Journal on Scientific Computing, 16.

[77] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning

irregular graphs. SIAM Journal on Scientific Computing.

[78] George Karypis and Vipin Kumar. Parallel multilevel k-way partitioning scheme for ir-

regular graphs. In Supercomputing Conference.

Page 150: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

133

[79] G. Karypis and V. Kumar. A fast and highly quality multilevel scheme for partitioning

irregular graphs. SIAM Journal on Scientific Computing, 1998.

[80] C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network

partitions. In Proceedings of the 19th Design Automation Conference, DAC ’82, pages

175–181, Piscataway, NJ, USA, 1982. IEEE Press.

[81] J. Demmel, S. C. Eisenstat, J. R. Gilbert, Xiaoye Sherry Li, and Joseph W. H. Liu. A

supernodal approach to sparse partial pivoting. Technical Report CSL–94–14, Xerox

Palo Alto Research Center, 1995.

[82] A. Gupta, F. Gustavson, M. Joshi, G. Karypis, and V. Kumar. PSPASES: An efficient

and scalable parallel sparse direct solver, 1999. See http://www-users.cs.umn.

edu/˜mjoshi/pspases.

[83] Anshul Gupta and Vipin Kumar. A scalable parallel algorithm for sparse matrix factoriza-

tion. Technical Report 94-19, Department of Computer Science, University of Minnesota,

Minneapolis, MN, 1994. A short version submitted for Supercomputing ’94.

[84] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Clarendon

Press, Oxford, 1986.

[85] J. A. George and J. W-H. Liu. Computer Solution of Large Sparse Positive Definite Sys-

tems. Prentice-Hall Inc., Englewood Cliffs, NJ, 1981.

[86] P. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering

algorithm. SIAM J. Matrix Anal. Appl., 17:886–905, 1996.

Page 151: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

134

[87] B. Hendrickson and E. Rothberg. Improving the runtime and quality of nested dissection

ordering. Technical report, Sandia National Laboratories, Albuquerque, NM 87185, 1996.

[88] I. Lee, P. Raghavan, and E. G. Ng. Ordering schemes for preconditioning with sparse

incomplete factors. In Proceedings of the International Conference On Preconditioning

Techniques For Large Sparse Matrix Problems In Scientific And Industrial Applications.

2003.

[89] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.

J. Res. Nat. Bur. Standards., 49:409–436, 1952.

[90] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, 1997.

[91] A. Greenbaum and Z. Strakos. Predicting the behavior of finite precision lanczos and

conjugate gradient computations. SIAM J. Matrix Anal. Appl., 13:121–137, 1992.

[92] A. Van Der Sluis and H. A. Van Der Vorst. The rate of convergence of conjugate gradients.

Numer. Math., 48:543–560, 1986.

[93] D. Luenberger. Introduction to Linear and Nonlinear Programming. Addison Wesley,

second edition, 1984.

[94] O. Axelsson. A survey of preconditioned iterative methods for linear systems of equations.

BIT, 25:166–187, 1987.

[95] P. Concus, G. Golub, and D. O’Leary. A generalized conjugate gradient method for the

numerical solution of elliptic partial differential equations. In J. R. Bunch and D. J. Rose,

editors, Sparse Matrix Computations, pages 309–332. Academic Press, 1976.

Page 152: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

135

[96] G.H. Golub and C.F. Van Loan. Matrix Computations. The Johns Hopkins University

Press, 1989.

[97] T. A. Manteuffel. An incomplete factorization technique for positive definite linear sys-

tems. Math. Comput., 34:473–497, 1980.

[98] Gary L. Miller, Shang hua Teng, and Stephen A. Vavasis. A unified geometric approach

to graph separators. In IEEE Symposium on Foundations of Computer Science, pages

538–547, 1991.

[99] C. Walshaw and M. Cross. Parallel optimisation algorithms for multilevel mesh partition-

ing. Parallel Computing, 26:1635–1660, 2000.

[100] F. Pellegrini and J. Roman. Scotch: A software package for static mapping by dual recur-

sive bipartitioning of process and architecture graphs. In High-Performance Computing

and Networking, pages 493–498, 1996.

[101] C. Chevalier and F. Pellegrini. PT-scotch: A tool for efficient parallel graph ordering.

Computing Research Repository, 2009.

[102] F. Gioachin, A. Sharma, S. Chakravorty, C. Mendes, L. V. Kale, and T. R. Quinn. Scalable

cosmology simulations on parallel machines. In VECPAR 2006, LNCS 4395, pp. 476-489,

2007.

[103] F. Gioachin, P. Jetley, C. L. Mendes, L. V. Kale, and T. R. Quinn. Toward petascale

cosmological simulations with ChaNGa. Technical Report 07-08, 2007.

[104] J. Barnes and P. Hut. A hierarchical o(n log n) force calculation algorithm. Nature, page

324, 1986.

Page 153: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

136

[105] Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale, and Thomas R. Quinn.

Massively parallel cosmological simulations with ChaNGa. In Proceedings of IEEE In-

ternational Parallel and Distributed Processing Symposium 2008, pages 1–12, 2008.

[106] P. Jetley, L. Wesolowski, F. Gioachin, L. V. Kale, and T. R. Quinn. Scaling Hierarchical

N -body Simulations on GPU Clusters. In Proceedings of the ACM/IEEE Supercomputing

Conference 2010, 2010.

[107] M. A. Heroux, P. Raghavan, and H. D. Simon. Parallel Processing for Scientific Comput-

ing (Software, Environments and Tools). SIAM, Philadelphia, PA, USA, 2006.

[108] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press,

Baltimore, MD, third edition, 1996.

[109] D. Rixen, C. Farhat, R. Tezaur, and J. Mandel. Theoretical Comparison of the FETI and

Algebraically Partitioned FETI Methods, and Performance Comparisons with a Direct

Sparse Solver. Int. J. Numer. Meth. Engrg., 46:501–534, 1999.

[110] C. Farhat and J. Li. An iterative domain decomposition method for the solution of a

class of indefinite problems in computational structural dynamics. Appl. Numer. Math.,

54(2):150–166, 2005.

[111] J. Sun, P. Michaleris, A. Gupta, and P. Raghavan. A fast implementation of the FETI-

DP method: FETI-DP-RBS-LNA and applications on large scale problems with localized

nonlinearities. International Journal for Numerical Methods in Engineering, 60(4):833–

858, 2005.

Page 154: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

137

[112] M. T. Heath and P. Raghavan. A Cartesian nested dissection algorithm. SIAM J. Matrix

Anal. Appl., 16(1):235–253, 1995.

[113] A. George and J. W-H Liu. An automatic nested dissection algorithm for irregular finite

element problems. SIAM J. Numer. Anal., 15:1053–1069, 1978.

[114] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning

irregular graphs. SIAM Journal on Scientific Computing, 20:359–392, 1998.

[115] R. J. Lipton, D. J. Rose, and R. E. Tarjan. Generalized nested dissection. SIAM J. Numer.

Anal., 16:346–358, 1979.

[116] J. R. Gilbert, G. L. Miller, and S. H. Teng. Geometric mesh partitioning: Implementation

and experiments. Technical Report CSL–94–13, Xerox Palo Alto Research Center, 1994.

A short version appears in the proceedings of IPPS 1995.

[117] P. Raghavan and K. Teranishi. Parallel hybrid preconditioning: Incomplete factorization

with selective sparse approximate inversion. SIAM J. Sci. Comput., 32:1323–1345, May

2010.

[118] Lois Mansfield. On the conjugate gradient solution of the schur complement system

obtained from domain decomposition. SIAM J. Numer. Anal., 27:1612–1620, November

1990.

Page 155: EXPLOITING SPARSITY, STRUCTURE, AND GEOMETRY FOR …

Vita

Anirban Chatterjee is a Graduate student in the Department of Computer Science and

Engineering at The Pennsylvania State University, working in the Scalable Scientific Computing

Laboratory, under the guidance of Professor Padma Raghavan.

Anirban came to Penn State with Bachelor and Master of Science degrees in Computer

Science from University Of Pune. While working on his Masters program, he worked with Dr.

Govind Swarup at the Tata Institute of Fundamental Research (TIFR) on Computational Mod-

eling in Radio Astronomy where he developed an algorithm for Location of Radio Frequency

Interference using the Giant Meterwave Radio Telescope.

While at Penn State, Anirban worked on Data Mining and Computational Modeling prob-

lems. Between 2006-2009 he also worked on Software Engineering of Phasefield Codes for the

Center for Computational Materials Design (CCMD) with Professor Long-Qing Chen (Depart-

ment of Material Science and Engineering). Since 2006, Anirban has been an active student

member of IEEE (Institute of Electrical and Electronics Engineers), ACM (Association of Com-

puting Machinery), and SIAM (Society of Industrial and Applied Mathematics).

In the Summer of 2009, Anirban was awarded a Master of Engineering Degree from

the Department of Computer Science and Engineering. During Fall 2010, he was a graduate

instructor for CMPSC 450 (Concurrent Scientific Computing).

In his free time, Anirban enjoys photography and traveling besides spending time with

his family and rabbit, Purple.