Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Real-Time Analytics for Complex Structure Data
Ting Guo
A Thesis submitted for the degree of Doctor of Philosophy
Faculty of Engineering and Information Technology University
of Technology, Sydney 2015
2
Certificate of Authorship and Originality
I certify that the work in this thesis has not previously been submitted for a degree nor
has it been submitted as part of requirements for a degree except as fully acknowledged
within the text.
I also certify that the thesis has been written by me. Any help that I have received in my
research work and the preparation of the thesis itself has been acknowledged. In addition,
I certify that all information sources and literature used are indicated in the thesis.
Signature of Student:
Date:
3
4
Acknowledgments
On having completed this thesis, I am especially thankful to my supervisor Prof. Chengqi
Zhang and co-supervisor Prof. Xingquan Zhu, who had led me to an at one time unfamiliar
area of academic research, and trusted me and given me as much as possible freedom to
purse my own research interests. Prof. Zhu has taught me how to think and study indepen-
dently and how to solve a difficult scientific problem in flexible but rigorous ways. He has
sacrificed much of his precious time for developing my academic research skills. When I
felt lost and terrified with my future, he always gave me the confidence and motivation to
keep going and strive to get better. Prof. Zhang has also given me great help and support
in life.
I am thankful to the group members I met in the University of Technology, Sydney,
including Shirui Pan, Lianhua Chi, Jia Wu, and many others. I learned a lot from these
smart people, and I was always inspired by the interesting and in-depth discussions with
them. I enjoyed the wonderful atmosphere, being with them, of both academic research
and daily life.
I am incredibly grateful to my mother and father for their generosity and encourage-
ment. This thesis is definitely impossible to be completed without their constant support
and understanding. I am also thankful to my friends who have companied me, though not
always at my side, through the arduous journey of three years.
5
6
Contents
1 Introduction 19
1.1 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 CONTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3 PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.4 THESIS STRUCTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2 Literature Review 31
2.1 PRELIMINARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 GRPAH CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 FREQUENT SUB-GRAPH MINING (FSM) . . . . . . . . . . . . . . . . 34
2.4 SUB-GRPAH FEATURE SELECTION . . . . . . . . . . . . . . . . . . . 35
2.5 DATA STREAM MINING . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 REAL-TIME ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7 ROADMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Understanding the Roles of Sub-graph Features for Graph Classification: An
Empirical Study Perspective 39
3.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 PROBLEM FORMULATION . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Graph and Sub-graph . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Frequent Sub-graph Mining . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Graph Classification . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 EXPERIMENTAL STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7
3.3.1 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Sub-graph Features . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Graph Hashing and Factorization for Fast Graph Stream Classification 59
4.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 PROBLEM DEFINITION . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 GRAPH FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Factorization Model . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 FAST GRAPH STREAM CLASSIFICATION . . . . . . . . . . . . . . . . 68
4.4.1 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.2 Graph Clique Mining . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.3 Clique Set Matrix and Graph Factorization . . . . . . . . . . . . . 72
Discriminative Frequent Cliques . . . . . . . . . . . . . . . . . . . 72
Feature Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.4 Graph Stream Classification . . . . . . . . . . . . . . . . . . . . . 75
4.5 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.1 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 79
Graph Steams Classification Accuracy . . . . . . . . . . . . . . . . 79
Graph Steam Classification Efficiency . . . . . . . . . . . . . . . . 83
5 Super-graph based Classification 85
5.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 PROBLEM DEFINITION . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 OVERALL FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 WEIGHTED RANDOM WALK KERNEL . . . . . . . . . . . . . . . . . 89
5.4.1 Kernel on Single-attribute Graphs . . . . . . . . . . . . . . . . . . 91
8
5.4.2 Kernel on Super-Graphs . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 SUPER-GRAPH CLASSIFICATION . . . . . . . . . . . . . . . . . . . . 96
5.6 THEORETICAL STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.7 EXPERIMENTS AND ANALYSIS . . . . . . . . . . . . . . . . . . . . . 98
5.7.1 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . 100
5.7.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 Streaming Network Node Classification 105
6.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 PROBLEM DEFINITION AND FRAMEWORK . . . . . . . . . . . . . . 109
6.3 THE PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.1 Streaming Network Feature Selection . . . . . . . . . . . . . . . . 113
Feature Selection on a Static Network . . . . . . . . . . . . . . . . 113
Feature Selection on Streaming Networks . . . . . . . . . . . . . . 117
6.3.2 Node Classification on Streaming Networks . . . . . . . . . . . . . 121
6.4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.2 Performance on Static Networks . . . . . . . . . . . . . . . . . . . 125
6.4.3 Performance on Streaming Networks . . . . . . . . . . . . . . . . 128
6.4.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7 Conclusion 133
7.1 SUMMARY OF THIS THESIS . . . . . . . . . . . . . . . . . . . . . . . 133
7.2 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9
10
List of Figures
2-1 Graph examples collected from different domains. . . . . . . . . . . . . . . 33
2-2 The overall roadmap of this thesis. . . . . . . . . . . . . . . . . . . . . . . 38
3-1 An example of sub-graph pattern representation. Left panel shows two
graphs, G1 and G2 and right panel gives the two indicator vectors showing
whether a sub-graph exists in the graphs. . . . . . . . . . . . . . . . . . . . 40
3-2 The runtime of frequent sub-graph pattern mining with respect to the in-
creasing number of edges of sub-graphs. . . . . . . . . . . . . . . . . . . . 42
3-3 A conceptual view of graph vs. sub-graph. (b) is a sub-graph of (a). . . . . 44
3-4 An example of graph isomorphism. . . . . . . . . . . . . . . . . . . . . . . 45
3-5 Graph representation for a paper (ID17890) in DBLP. Node in red is the
main paper. Nodes in black ellipse are citations. While nodes in black box
are keywords. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3-6 Classification accuracy on five NCI chemical compound datasets with re-
spect to different sizes of sub-graph features (using Support Vector Ma-
chines: Lib-SVM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3-7 Classification accuracy on D&D protein dataset and DBLP citation dataset
with respect to different sizes of sub-graph features (using Support Vector
Machines: Lib-SVM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3-8 Classification accuracy on one NCI chemical compound dataset, D&D pro-
tein dataset, and DBLP citation dataset with respect to different sizes of
sub-graph features (using Nearest Neighbours: NN). . . . . . . . . . . . . 54
4-1 Coarse-grained vs. fine-grained representation. . . . . . . . . . . . . . . . 60
11
4-2 An example of graph factorization. . . . . . . . . . . . . . . . . . . . . . . 64
4-3 The framework of FGSC for graph stream classification. . . . . . . . . . . 69
4-4 An example of clique mining in a compressed graph. . . . . . . . . . . . . 71
4-5 An example of “in-memory” Clique-class table Γ. . . . . . . . . . . . . . . 74
4-6 Graph representation for a paper (ID17890) in DBLP. . . . . . . . . . . . . 77
4-7 Accuracy w.r.t different chunk sizes on DBLP Stream. The number of
features in each chunk is 142. The batch sizes vary as: (a) 1000; (b) 800;
(c) 600. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4-8 Accuracy w.r.t different number of features on DBLP Stream with each
chunk containing 1000 graphs. The number of features selected in each
chunk is: (a) 307; (b) 142; (c) 62. . . . . . . . . . . . . . . . . . . . . . . 80
4-9 Accuracy w.r.t different classification methods on DBLP Stream with each
chunk containing 1000 graphs, and the number of features in each chunk
is 142. The classification methods selected here are: (a) NN; (b) SMO; (c)
NaiveBayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4-10 Accuracy w.r.t different chunk sizes on IBM Stream. The number of fea-
tures in each chunk is 75. The batch sizes vary from (a) 500; (b) 400; to (c)
300. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4-11 Accuracy w.r.t different number of features on IBM Stream with each
chunk containing 400 graphs. The number of features selected in each
chunk is: (a) 148; (b) 75; (c) 43. . . . . . . . . . . . . . . . . . . . . . . . 81
4-12 Accuracy w.r.t different classification methods on IBM Stream with each
chunk containing 400 graphs, and the number of features in each chunk
is 75. The classification methods selected include: (a) NN; (b) SMO; (c)
NaiveBayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4-13 System accumulated runtime-based by using NN classifier, where |D| =1000, |m| = 142 (for DBLP) and |D| = 400, |m| = 75 (for IBM) respec-
tively. (a) Results on DBLP stream; (b) Results on IBM stream. . . . . . . 83
5-1 (A): a single-attribute graph; (B): an attributed graph; and (C): a super-graph. 86
12
5-2 A conceptual view of a protein interaction network using super-graph rep-
resentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5-3 WRWK on the super-graphs (G, G′) and the single-attribute graphs (g1,
g2, g3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5-4 An example of using super-graph representation for scientific publications. . 99
5-5 Super-graph and comparison graph representations. . . . . . . . . . . . . . 100
5-6 Classification accuracy on DBLP and Beer Review datasets w.r.t. different
classification methods (NB, DT, SVM, and NN). . . . . . . . . . . . . . . 102
5-7 Classification accuracy on Beer Review dataset w.r.t. different datasets and
classification methods (NB, DT, SVM, and NN). . . . . . . . . . . . . . . 103
5-8 The performance w.r.t. different edge-cutting thresholds on DBLP and Beer
Review datasets by using WRWK method. . . . . . . . . . . . . . . . . . 104
6-1 An example of streaming networks, where each color bar denotes a feature. 106
6-2 An example of using feature selection to capture changes in a streaming
network (keywords inside each node denote node content). . . . . . . . . . 108
6-3 The framework of the proposed streaming network node classification (SNOC)
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6-4 An example of using feature selection to capture structure similarity. . . . . 115
6-5 The accuracies on three real-world static networks w.r.t. different numbers
of selected features (from 50 to 300). . . . . . . . . . . . . . . . . . . . . . 125
6-6 The accuracy on three networks w.r.t. (a) different maximal lengths of path
l (from 1 to 5), (b) different values of weight parameter ξ (from 0 to 1), and
(c) different percentages of labeled nodes. . . . . . . . . . . . . . . . . . . 126
6-7 The accuracy on streaming networks: (a) accuracy on DBLP citation net-
work from 1991 to 2010, (b) accuracy on PubMed Diabetes network for
15 time points, and (c) accuracy on extended DBLP citation network from
1991 to 2010. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6-8 The cumulative runtime on DBLP and PubMed Diabetes networks corre-
sponding to Fig. 6-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13
6-9 Case study on DBLP citation network. . . . . . . . . . . . . . . . . . . . . 131
14
List of Tables
3.1 The advantages and disadvantages comparisons between vector represen-
tation vs. graph representation . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 NCI datasets used in experiments . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 DBLP dataset used in experiments . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Number of sub-graphs with respect to different sizes (i.e. number of edges) 49
4.1 DBLP dataset used in experiments. . . . . . . . . . . . . . . . . . . . . . . 76
6.1 Accuracy Results on Static Network. . . . . . . . . . . . . . . . . . . . . . 126
15
Abstract
The advancement of data acquisition and analysis technology has resulted in many real-
world data being dynamic and containing rich content and structured information. More
specifically, with the fast development of information technology, many current real-world
data are always featured with dynamic changes, such as new instances, new nodes and
edges, and modifications to the node content. Different from traditional data, which are
represented as feature vectors, data with complex relationships are often represented as
graphs to denote the content of the data entries and their structural relationships, where
instances (nodes) are not only characterized by the content but are also subject to depen-
dency relationships. Plus, real-time availability is one of outstanding features of today’s
data. Real-time analytics is dynamic analysis and reporting based on data entered into a
system before the actual time of use. Real-time analytics emphasizes on deriving immedi-
ate knowledge from dynamic data sources, such as data streams, and knowledge discovery
and pattern mining are facing complex, dynamic data sources. However, how to combine
structure information and node content information for accurate and real-time data mining
is still a big challenge. Accordingly, this thesis focuses on real-time analytics for com-
plex structure data. We explore instance correlation in complex structure data and utilises
it to make mining tasks more accurate and applicable. To be specific, our objective is to
combine node correlation with node content and utilize them for three different tasks, in-
cluding (1) graph stream classification, (2) super-graph classification and clustering, and
(3) streaming network node classification.
Understanding the role of structured patterns for graph classification: the thesis in-
troduces existing works on data mining from an complex structured perspective. Then we
propose a graph factorization-based fine-grained representation model, where the main ob-
jective is to use linear combinations of a set of discriminative cliques to represent graphs
for learning. The optimization-oriented factorization approach ensures minimum informa-
tion loss for graph representation, and also avoids the expensive sub-graph isomorphism
validation process. Based on this idea, we propose a novel framework for fast graph stream
classification.
16
A new structure data classification algorithm: The second method introduces a new
super-graph classification and clustering problem. Due to the inherent complex struc-
ture representation, all existing graph classification methods cannot be applied to super-
graph classification. In the thesis, we propose a weighted random walk kernel which cal-
culates the similarity between two super-graphs by assessing (a) the similarity between
super-nodes of the super-graphs, and (b) the common walks of the super-graphs. Our key
contribution is: (1) a new super-node and super-graph structure to enrich existing graph
representation for real-world applications; (2) a weighted random walk kernel considering
node and structure similarities between graphs; (3) a mixed-similarity considering struc-
tured content inside super-nodes and structural dependency between super-nodes; and (4)
an effective kernel-based super-graph classification method with sound theoretical basis.
Empirical studies show that the proposed methods significantly outperform the state-of-
the-art methods.
Real-time analytics framework for dynamic complex structure data For streaming net-
works, the essential challenge is to properly capture the dynamic evolution of the node
content and node interactions in order to support node classification. While streaming net-
works are dynamically evolving, for a short temporal period, a subset of salient features are
essentially tied to the network content and structures, and therefore can be used to charac-
terize the network for classification. To achieve this goal, we propose to carry out streaming
network feature selection (SNF) from the network, and use selected features as gauge to
classify unlabeled nodes. A Laplacian based quality criterion is proposed to guide the node
classification, where the Laplacian matrix is generated based on node labels and network
topology structures. Node classification is achieved by finding the class label that results in
the minimal gauging value with respect to the selected features. By frequently updating the
features selected from the network, node classification can quickly adapt to the changes in
the network for maximal performance gain. Experiments and comparisons on real-world
networks demonstrate that SNOC is able to capture dynamics in the network structures and
node content, and outperforms baseline approaches with significant performance gain.
17
18
Chapter 1
Introduction
1.1 MOTIVATION
The advancement of data acquisition and analysis technology has resulted in many applica-
tions involving complex structure data; examples include Cheminformatics [65], Bioinfor-
matics [53], and Social Network Analysis (e.g. DBLP) [3]. Different from traditional data,
which are represented as feature vectors, data with structure relationships are often repre-
sented as graphs to preserve the content of the data entries and their relationships, where
instances (nodes) are not only characterized by the content but are also subject to depen-
dency relationships. For example, each node in a social network can denote one person and
links between nodes can represent their social interactions.
In reality, changes are essential components in real-world structure data, mainly be-
cause user participation, interactions, and responses to external factors continuously intro-
duce new nodes and edges to the data. In addition, a user may add/delete/modify online
posts, which naturally result in changes in the node content. As a result, data are inherently
dynamic. These dynamic changes may significantly influence the mining results. For ex-
ample, changes in users’ interest fields may result in the changes of their social groups or
friend circles. So real-time analytics on structure data could help us understand and capture
such changes and therefore improve the mining performances.
Graph classification concerns the learning of a discriminative classifier, from training
data containing structure information, to classify previously unseen graph samples into spe-
19
cific categories, where the main challenge is to explore structure information in the training
data to build classifiers. One of the most common graph classification approaches is to
use sub-graph features to convert graph into instance-feature representations, so generic
learning algorithms can be applied for classification. Finding good sub-graph features is
therefore an important task for this type of learning approaches. Common heuristics are to
find high frequency sub-graphs or to apply feature selection measures, such as information
gain or Gini-index, to refine frequent sub-graphs. While all these methods have shown to
be effective in the literature, due to the inherent complexity of the graph data, the process of
finding sub-graphs is a non-trivial task and is always the most computationally expensive
procedure in the graph classification process. In addition, the genuine discriminative power
of the sub-graph features has never been comprehensively studied. All these observations
raise big concerns:
• The mining of the sub-graphs involves graph matching operation which is computa-
tionally demanding. Indeed, just testing whether a graph is a sub-graph of another
graph is an NP-complete problem [74]. It is very time-consuming to match patterns
in graph datasets, especially for sub-graphs with a large number of edges.
• We are interested in finding discriminative power of sub-graphs. In other words, we
want to know how much the classification accuracy can be affected by different prop-
erties of the selected sub-graphs, such as different number of nodes, different number
of edges, and the size of the selected sub-graph set. To achieve the goal, a typical ob-
jective function, information gain (IG) [36], is used to test the discriminative power
of attributions (sub-graphs). Experiments show that the objective function are nei-
ther monotonic nor anti-monotonic with respect to the size of the sub-graphs [81].
In addition, the internal structural correlation between sub-graphs can also impact on
the classification result. For instance, sub-graphs with similar structures tend to have
similar objective scores, whereas using sub-graphs with high correlations for clas-
sification should be avoided because it will introduce redundant features with high
dependency which may deteriorate the classification accuracy. This fact can help to
select better sub-graphs to represent graph datasets.
20
• Most given graph classification methods are computationally expensive, we are won-
dering whether there is any inexpensive way to achieve graph classification. More
specifically, if we use some random sub-graphs to represent a graph datasets, how
good the classification accuracy will be, compared to the approaches that use sophis-
ticated sub-graph mining algorithms?
Motivated by the above concerns, empirical studies on the role of sub-graph for graph
classification. We carry out empirical studies on four real-world graph classification tasks,
by using three types of sub-graph features, including frequent sub-graphs, frequent sub-
graph selected by using information gain, and random sub-graphs, and by using two types
of learning algorithms including Support Vector Machines and Nearest Neighbour. Our
experiments show that (1) The discriminative power of sub-graphs varies by their sizes;
(2) Random sub-graphs have a reasonably good performance; (3) Number of sub-graphs
is important to ensure good performance; (4) Increasing number of sub-graphs reduces the
difference between classifiers built from different sub-graphs, and (5) Performance with
respect to different classifiers shows similar trend. Our studies provide a practical guidance
for designing effective sub-graph based graph classification methods.
From the empirical studies, we discover a fact that unlike traditional instance-feature
representations, graphs do not have specific features, so an essential step in building a graph
classification model is to explore graph features to represent graph data in an instance-
feature space for effective learning [57]. The majority of existing graph classification mod-
els employ an occurrence-based feature representation model, in which a set of sub-graph
features is selected to represent the graph data by using the occurrence of the sub-graph in
the graph (either 0/1 occurrence or actual number of occurrences) as the feature value. In
this thesis, we refer to such an occurrence based representation model as coarse-grained
representation model.
For existing sub-graph based graph classification methods (coarse-grained representa-
tion models), they have a number of disadvantages, especially for graph streaming clas-
sification including: Computation burden for isomorphism validation: Because sub-
graphs are selected as features to represent the graph, the expensive sub-graph mining and
isomorphism validation process can not be avoided (graph isomorphism is proven to be
21
NP-hard) [19]. After the sub-graph features have been selected, the isomorphism valida-
tion process has to be applied again to map the graphs into the vector space. Because
sub-graphs are selected as features to represent the graph and general learning methods
prefer low feature dimension, a coarse-grained feature representation model has to limit
the number of sub-graph to represent the graph data. Severe and unbounded informa-
tion loss: Because the coarse-grained representation does not characterize the degree of
closeness of the sub-graph pattern related to the graph, it suffers information loss in repre-
senting the graph. Furthermore, for a graph stream with continuously growing volumes (i.e.
an increasing number of graphs) and changing structures (e.g. new nodes may appear in
the coming graphs), the disadvantage of the existing coarse-grained graph feature model is
even greater. This is because (1) stream volumes continuously increase, making it compu-
tationally expensive to explore subgraph features, and (2) the changes in the graph stream
(such as new structures) make the sub-graph features incapable of representing graphs. So
a fine-grained graph feature model is needed to improve the classification performance.
To present a novel fine-grained graph feature model, our main idea is to bypass the
expensive sub-graph mining process and use linear combinations of graph cliques to rep-
resent graphs. Linear combinations is to make sure the information loss of the transform
process from graphs to vectors can be minimal. Compared to the traditional coarse-grained
representation model, the advantage of fine-grained representation is clear. A clear advan-
tage of our method is that fine-grained graph representation enjoys a rigorous theoretical
foundation that ensures minimal information loss for graph representation, and also avoids
the expensive sub-graph isomorphism validation process, which is proven to be NP-hard.
To achieve this goal, our algorithm relies on two important steps to extract fine-grained
graph representation: (1) finding a set of frequent graph cliques as the base; and (2) us-
ing graph factorization to calculate a linear combination of graph cliques to best represent
a graph. After applying our fine-grained feature mapping model to represent graphs, we
can use any learning algorithm, such as Nearest Neighbor or Support Vector Machines, for
graph stream classification. Our method offers a number of advantages including a fast fea-
ture mining process, more precise graph representation, and better performance for graph
stream classification. Experiments on two real-world network graph data demonstrate that
22
our method outperforms state-of-the-art approaches in both classification accuracy and run-
time efficiency.
Even though graph representation indeed preserves the structure information to improve
the classification accuracy, all existing frameworks, in terms of using graphs to represent
objects, rely on two approaches to describe node content (1) node as a single attribute:
each node has only one attribute (single-attribute node). A clear drawback of this represen-
tation is that a single attribute cannot precisely describe the node content [40]. This repre-
sentation is commonly referred to as a single-attribute graph (Fig.5-1 (A)). (2) node as a
set of attributes: use a set of independent attributes to describe the node content (Fig.5-1
(B)). This representation is commonly referred to as an attributed graph [8, 14, 80]. How-
ever, with the development of information and technology, the data is becoming more and
more complex. Indeed, in many applications, the attributes/properties used to describe the
node content may be subject to dependency structures. For example, in a citation network
each node represents one paper and edges denote citation relationships. It is insufficient to
use one or multiple independent attributes to describe detailed information of a paper. In-
stead, we can represent the content of each paper as a graph with nodes denoting keywords
and edges representing contextual correlations between keywords (e.g. co-occurrence of
keywords in different sentences or paragraphs). As a result, each paper (a super-node) and
all references cited in this thesis can form a super-graph with each edge between papers
denoting their citation relationships.
To build learning models for super-graphs, the mainly challenge is to properly calculate
the distance between two super-graphs.
• Similarity between two super-nodes: Because each super-node is a graph, the over-
lapped/intersected graph structure between two super-nodes reveals the similarity
between two super-nodes, as well as the relationship between two super-graphs. Tra-
ditional hard-node-matching mechanism is unsuitable for super-graphs which require
soft-node-matching.
• Similarity between two super-graphs: The complex structure of super-graph re-
quires that the similarity measure considers not only the structure similarity, but also
23
the super-node similarity between two super-graphs. This cannot be achieved with-
out combining node matching and graph matching as a whole to assess similarity
between super-graphs.
The above challenges motivate the proposed Weighted Random Walk Kernel (WRWK)
for super-graphs. In our paper, we generate a new product graph from two super-graphs
and then use weighted random walks on the product graph to calculate similarity between
super-graphs. A weighted random walk denotes a walk starting from a random weighted
node and following succeeding weighted nodes and edges in a random manner. The weight
of the node in the product graph denotes the similarity of two super-nodes. Given a set
of labeled super-graphs, we can use an weighted product graph to establish walk-based
relationship between two super-graphs and calculate their similarities. After that, we can
obtain the kernel matrix for super-graph classification.
Recent years have also witnessed an increasing number of applications involving net-
worked data, where instances are not only characterized by the content but are also subject
to dependency relationships. The mixed node content and structure information raise many
unique data mining tasks, such as network node classification [3]. In reality, changes are
essential components in real-world networks, mainly because user participation, interac-
tions, and responses to external factors continuously introduce new nodes and edges to the
network. In addition, a user may add/delete/modify online posts, which naturally result in
changes in the node content. As a result, the networks are inherently dynamic. Accurate
node classification in a streaming network setting is therefore much more challenging than
static networks. In summary, node classification in streaming networks has at least three
major challenges:
• Streaming network structures: Network structures encode rich information about
node interactions inside the network, which should be considered for node classifi-
cation. In streaming networks, structures are constantly changing, so node classi-
fication needs to rapidly capture and adapt to such changes for maximal accuracy
gain.
• Streaming node features: For each node in a streaming network, its content may
24
constantly evolve (w.r.t. users posts or profile updating). As a result, the feature space
used to denote the node content is dynamically changing, resulting in streaming fea-
tures [78] with infinite feature space. To capture changes, a feature selection method
should timely select the most effective features to ensure that node classification can
quickly adapt to the new network.
• Unlimited Network Node Space: Because node volumes of streaming networks are
dynamically increasing, resulting in unlimited network node space and new nodes
never appearing in the network before. Node classification needs to scale to the dy-
namic increasing node volumes and incrementally updates models discovered from
historical data to accurately classify new nodes.
To achieve a high node classification accuracy, a fundamental issue is to properly char-
acterize such changes. One possible solution is to use features to capture changes in
streaming networks for node classification. For streaming networks, changes are intro-
duced through two major channels (1) node content; and (2) topology structures. Because
in a networked world, nodes close to each other in the network structure space tend to
share common content information [26], we can use selected features to design a “similar-
ity gauging” procedure to assess the consistency of the network node content and structures
to determine the labels of unlabeled nodes. A smaller gauging value indicates that the node
content and structures has a better alignment with the node label. So the gauging based
classification is carried out such that for an unlabeled node, its label is the class which
results in the minimal gauging value with respect to the identified features. By updating
the selected features, the node classification can automatically adapt to the changes in the
streaming network for maximal accuracy gain. Accordingly, our research will propose a
node classification method for networked data, and then propose a streaming feature selec-
tion approach for dynamic drafting network.
In summary, taking complex structure information into account helps improve the min-
ing performance on structure data. Moreover, employing changes in dynamic setting is
a promising solution to further improving mining accuracy. In this thesis, we investigate
different types of complex structure data and propose several methods to solve above chal-
25
lenges.
1.2 CONTRIBUTIONS
This thesis focuses on exploring and utilizing structure information to solve some important
problems in data mining. We list our contributions to each of them below:
• Exploring the roles of sub-graph features for graph classification: we carry out
empirical studies on four real-world graph classification tasks, by using three types
of sub-graph features, including frequent sub-graphs, frequent sub-graph selected by
using information gain, and random sub-graphs, and by using two types of learning
algorithms including Support Vector Machines and Nearest Neighbour. Our experi-
ments show that (1) The discriminative power of sub-graphs varies by their sizes; (2)
Random sub-graphs have a reasonably good performance; (3) Number of sub-graphs
is important to ensure good performance; (4) Increasing number of sub-graphs re-
duces the difference between classifiers built from different sub-graphs, and (5) Per-
formance with respect to different classifiers shows similar trend. Our studies pro-
vide a practical guidance for designing effective sub-graph based graph classification
methods.
• Exploring clique information and using factorization method for fast graph
stream classification: we propose a fine-grained graph factorization framework for
efficient graph stream classification in this thesis. Being fine-grained, our mapping
framework relies on a set of discriminative frequent cliques, instead of important
sub-graph patterns, to represent graphs. Such a fine-grained representation ensures
that the final instance-feature representation is sufficiently close to the original graph,
with theoretical guarantee. To solve the problem, our algorithm relies on two impor-
tant steps to extract fine-grained graph representation: (1) finding a set of frequent
graph cliques as the base; and (2) using graph factorization to calculate a linear com-
bination of the graph cliques to best represent a graph. Compared to the traditional
coarse-grained representation model, a clear advantage of our method is that fine-
26
grained graph representation enjoys a rigorous theoretical foundation that ensures
minimal information loss for graph representation, and also avoids the expensive
sub-graph isomorphism validation process.
• Exploring inner-structure information for super-graph classification: In this the-
sis, we introduce a special type of graph, where the content of the node can be
represented as a graph, as a “super-graph”. Likewise, we refer to the node whose
content is represented as a graph, as a “super-node”. To build learning models for
super-graphs, the mainly challenge is to properly calculate the distance between two
super-graphs.
(1) Similarity between two super-nodes: Because each super-node is a graph, the
overlapped/intersected graph structure between two super-nodes reveals the similar-
ity between two super-nodes, as well as the relationship between two super-graphs.
Traditional hard-node-matching mechanism is unsuitable for super-graphs which re-
quire soft-node-matching.
(2) Similarity between two super-graphs: The complex structure of super-graph
requires that the similarity measure considers not only the structure similarity, but
also the node similarity between two super-graphs. This cannot be achieved with-
out combining node matching and graph matching as a whole to assess similarity
between super-graphs.
The above challenges motivate the proposed Weighted Random Walk Kernel for
super-graphs. we propose a weighted random walk kernel which calculates the simi-
larity between two super-graphs by assessing (a) the similarity between super-nodes
of the super-graphs, and (b) the common walks of the super-graphs. Our key contri-
bution is twofold: (1) a weighted random walk kernel considering node and structure
similarities between graphs; and (2) an effective kernel-based super-graph classifica-
tion method with sound theoretical basis.
• Exploring manifold information for streaming network node nlassification: we
propose a novel node classification method for streaming networks. Our method
takes network structure and node labels into consideration to find an optimal subset
27
of features to represent the network. Based on the selected features, a streaming
network node classification method, SNOC, is proposed to classify unlabeled nodes
through the minimization of the similarity distance in the network and the feature-
based distance between nodes. The main contribution, compared to existing works,
is twofold:
(1) Streaming Network Node Classification: We propose a new streaming network
node classification (SNOC) method that takes node content and structure similarity
into consideration to find important features to model changes in the network for
node classification. This method is not only more accurate than existing node clas-
sification approaches, but is also effective to capture changes in networks for node
classification.
(2) Streaming Network Feature Selection: We introduce a novel streaming net-
work feature selection framework, SNF, for streaming networks. To ensure feature
evaluation can timely adapt to changes in the network, SNF incrementally updates
the evaluation score of an existing feature by accumulating changes in the network.
This allows our method to effectively handle streaming networks with changing fea-
ture space and feature distributions for better runtime and performance gain.
1.3 PUBLICATIONS
• Ting Guo, Zhanshan Li, and Xingquan Zhu. Large Scale Diagnosis Using Associ-
ations between System Outputs and Components. Proceedings of the Twenty-Fifth
AAAI Conference on Artificial Intelligence, 2011, pp.1786-1787.
• Ting Guo and Xingquan Zhu. Understanding the roles of sub-graph features for
graph classification: an empirical study perspective. Proceedings of the 22nd ACM
international Conference on Information & Knowledge Management, 2013, pp. 817-
822.
• Ting Guo, Lianhua Chi, and Xingquan Zhu. Graph hashing and factorization for fast
graph stream classification. Proceedings of the 22nd ACM international Conference
28
on Information & Knowledge Management, 2013, pp. 1607-1612.
• Ting Guo and Xingquan Zhu. Super-Graph Classification. Proceedings of Advances
in Knowledge Discovery and Data Mining, 2014, pp. 323-336.
• Ting Guo, Xingquan Zhu, Jian Pei and Chengqi Zhang. SNOC: Streaming Network
Node Classification. Proceedings of the 14th IEEE International Conference on Data
Mining, 2014.
1.4 THESIS STRUCTURE
The rest of thesis is summarized as follows:
Chapter 2: This chapter surveys existing works on mining structure data. It summarizes
major approaches in the field, along with their technical strengths/weaknesses.
Chapter 3: In this chapter, we carry out empirical studies on four real-world graph clas-
sification tasks, by using three types of sub-graph features, including frequent sub-graphs,
frequent sub-graph selected by using information gain, and random sub-graphs. The two
types of learning algorithms include Support Vector Machines and Nearest Neighbour. Our
studies provide a practical guidance for designing effective sub-graph based graph classifi-
cation methods.
Chapter 4: We propose a fine-grained graph factorization approach for Fast Graph
Stream Classification (FGSC). A graph clique mining is given and the factorization al-
gorithm is provided based on the clique mining result. Experiments demonstrate that the
proposed method outperforms state-of-the-art approaches in both classification accurracy
and efficiency.
Chapter 5: In this chapter, we formulate a new super-graph classification task where
each node of the super-graph may contain a graph. To support super-graph classification,
we propose a Weighted Random Walk Kernel (WRWK) with sound theoretical properties,
including bounded similarity. Experiments confirm that our method significantly outper-
forms baseline approaches.
Chapter 6: This chapter introduces a new classification method for streaming networks,
29
namely streaming network node classification (SNOC). It provides algorithm details, theo-
retical proof, time complexity and experiments.
Chapter 7: This chapter concludes this thesis and outlines directions for future work.
30
Chapter 2
Literature Review
This literature review provides an in-depth study on how existing data mining methods
on complex structure data. Our main objective is to (1) summarize and categorize graph
classification methods, graph clustering methods and extended ones to streaming structure
data; and (2) compare and analyze the strengths and deficiencies of existing approaches.
Firstly, we introduce some basic definitions.
2.1 PRELIMINARY
DEFINITION 1 Graph: A graph G is a set of vertex (nodes) v connected by edges (links)
e. Thus G = (v, e).
DEFINITION 2 Vertex (Node): A node v is a terminal point or an intersection point of a
graph. It is the abstraction of a location such as a city, an administrative division, a road
intersection or a transport terminal (stations, terminuses, harbors and airports).
DEFINITION 3 Edge (Link): An edge e is a link between two nodes. The link (i, j) is
of initial extremity i and of terminal extremity j. A link is the abstraction of a transport
infrastructure supporting movements between nodes. It has a direction that is commonly
represented as an arrow. When an arrow is not used, it is assumed the link is bi-directional.
DEFINITION 4 Sub-Graph: A sub-graph is a subset of a graph G, where p is the num-
ber of sub-graphs. For instance G′ = (v′, e′) can be a distinct sub-graph of G. Unless
31
the global transport system is considered in its whole, every transport network is in the-
ory a sub-graph of another. For instance, the road transportation network of a city is a
sub-graph of a regional transportation network, which is itself a sub-graph of a national
transportation network.
2.2 GRPAH CLASSIFICATION
Given a collection of training samples, each of which is suitably labeled with a class label,
classification task is to build a learning model to automatically assign a previously unseen
example into a specific category (or class). For data with dependency structure, such as
the ones showing in Figure 2-11, building a learning model to classify them into respective
categories is proved to be a challenging task. This is mainly attributed to the fact that graph
data normally do not have features immediately available to support the learning, whereas
nearly all existing supervised learning methods requires that training data to be represented
as tabular instance-feature format for learning.
In order to support graph classification, one of the fundamental challenges is to charac-
terize graphs to represent them as tabula-feature formats, so the generic supervised learning
algorithm can be used to derive learning models from graphs. Three most popularly used
approaches include (1) Sub-graph feature based methods; (2) Global structure based meth-
ods, such as graph edit and graph embedding, and (3) Graph kernel methods.
Sub-graph feature based methods is using substructure patterns (sub-graphs) as features
to represent each graph as a feature vector. So a training graph set can be converted into a
generic training instance set for learning. Sub-graph based approaches are useful in many
graph related tasks, including discriminating different groups of graphs, classifying and
clustering graphs and building graph indices in vector spaces. An inherent advantage of
embedding graphs into vector space is that it makes existing algorithmic tools developed
1In Fig. 2-1, the upper ones are the original representations and the bottom ones are the corresponding
transferred graphs (a) Protein data (each node denotes a local amino acid region with special secondary
structures and each edge denotes the nearest neighbour in the space); (b) a graph collected from chemical
compound data (the 3D-structure can be transferred into graphs by using molecules connected with chemical
bonds); (c) a graph collected from DBLP citation dataset, where each node denotes a Paper ID or a keyword
and each edge denotes the citation relationship between papers or keywords appeared in the paper’s title. The
label of the graph (y) can be used for training a learning model for graph classification.
32
Figure 2-1: Graph examples collected from different domains.
for feature based object representations available for graph structure data. For sub-graph
feature based methods, one key challenge is how to select discriminative sub-graphs to
mapping graphs into vector space, which can help improve the classification accuracy. To
achieve the goal, several objective functions are used to test the discriminative power of
attributions (sub-graphs), like frequency, Information Gain (IG) [36], Fisher Score [23]
and Laplace Score [42]. Because sub-graph features can help convert each single graph
into a vector representation, graph classification can be achieved by using popular learning
algorithms such as Decision Trees [36] and Support Vector Machine (SVM) [16].
Global structure based methods is to apply inexact graph matching, in which error cor-
rection is made part of the matching process. Central to this approach is the measurement of
the similarity of pairwise graphs. This can be measured in many ways. One of them, which
has garnered particular interest as it is error-tolerant to noise and distortion, is the graph
edit distance (GED), defined as the cost of the least expensive sequence of edit operations
that are needed to transform one graph into another [24]. GED algorithms are influenced
considerably by cost functions that are related to edit operations. The GED between pair-
wise graphs changes with the change of cost functions and its validity is dependent on the
33
rationality of cost functions definition [6][17].
Graph kernel methods is using kernel to calculate the distance between graphs. Various
graph kernel methods, such as subtree kernels [63], shortest-path kernels [5], joint ker-
nels [30] and cyclic pattern kernels [31], have been proposed. All methods have shown
to be effective, but subject to high computational overhead. Geng Li et al. proposes an
alternative approach based on feature vectors constructed from different global topologi-
cal attributes [47]. A novel graph representation method named Graph edit distance is also
proposed by using dissimilarity space embedding graph kernel [38], where the edit distance
can be calculated by Lipschitz embedding method [39]. Experiments show that the classifi-
cation accuracy can be statistically significantly enhanced by using graph edit distance with
prototype reduction and dimensionality reduction [7]. Vogelstein et al. develops a set of
signal-subgraph estimators for graph classification [76]. This estimator can be considered
as local sparse and low-rank matrix decompositions.
2.3 FREQUENT SUB-GRAPH MINING (FSM)
Sub-graph patterns have been proved a good representation for closing the gap between
structural and statistical pattern classification. Selecting features, in the form of sub-graph
patterns, from graph data is a well established area in graph data mining, where early
methods often use Apriori-based Graph Mining (AGM) to identify frequent induced sub-
graphs [33]. Evaluation of AGM on chemical carcinogenesis data demonstrated that it is
more efficient than an inductive logic programming based approach combined with a level-
wise search. Meanwhile, Kuramochi and Karypis develop an FSG [45] method which uses
Breath First Search (BFS) strategy to grow candidates whereby pairs of identified frequent
k sub-graphs are joined to generate (k + 1) sub-graphs. FSG uses a canonical labeling
method for graph comparison and calculates the support of the patterns using a vertical
transaction list data representation. Experiments show that FSG is inefficient when graphs
contain many vertexes and edges that have identical labels because the join operation used
by FSG allows multiple automorphism of single or multiple cores. In summary, AGM and
FSG are all Apriori-based frequent sub-graph mining algorithms.
34
In Apriori-based frequent sub-graph mining algorithms, it is time-consuming and com-
putationally ineffective when two size-k frequent sub-graphs are joined to generate size-
(k+1) graph candidates [28]. To improve the efficiency, several pattern-growth-based graph
mining approaches have been proposed. FFSM [32] attempts to extend graphs from a single
sub-graph directly. For each graph G that has been discovered, FFSM recursively adds new
edges until all frequent supergraphs of G are discovered. The recursion stops if no more
frequent graphs can be generated. Because the pattern-growth algorithm extends a frequent
graph pattern by adding a new edge with every possible label, there is a potential risk that
same graph patterns may be generated more than once [28]. gSpan algorithm [82] solves
this problem by using a right-most extension technique. It uses Depth Fist Search (DFS)
lexicographic ordering to construct a tree-like lattice over all possible graph patterns. The
search tree is traversed in a DFS manner and all possible graph patterns with non-minimal
DFS codes are pruned so that redundant candidate generation is avoided. gSpan is arguably
the most frequently cited FSM algorithm.
2.4 SUB-GRPAH FEATURE SELECTION
To find discriminate sub-graph features for graph classification, common methods are to
first discover a complete set of sub-graph patterns by using a minimum support threshold
or other parameters such as sub-graph size limit [13], and then use a feature selection crite-
rion (such as Information Gain [12, 36]) to select a small set of discriminative sub-graphs
as features. Such a two-step sub-graph feature selection approach is considered ineffec-
tive and several methods are proposed to directly generate a compact set of discriminative
sub-graphs during the frequent sub-graph mining process. Kudo et al. [43, 59] proposes
a boosting based method, where a boosting algorithm repeatedly constructs multiple weak
classifiers on weighted training graphs and each weak classifier is, in fact, a single sub-
graph. Yan et al. proposes to mine significant sub-graphs by using biased search which
exploits the correlation between structural similarity and significance similarity [81]. Ranu
and Singh propose a scalable method, GraphSig, to mine significant sub-graphs based on
a feature vector representation of graphs [56]. GraphSig uses domain knowledge to select
35
a meaningful feature set and prior probabilities of features are used to evaluate the signifi-
cance of sub-graphs in the feature space. This strategy can use existing frequent sub-graph
mining techniques to mine significant patterns in a scalable manner. Saigo et al. proposes
an iterative sub-graph mining method based on partial least squares regression named gPLS
[58]. This method is efficient because the weight vector is updated with elementary matrix
calculations. To support real-time graph query, some researchers try to index a small set
of sub-graphs to enable scalable query of graph databases [57, 67, 83, 84]. Sun et al. in-
troduces a graph search method for sub-graph queries based on sub-graph frequencies. In
addition, evolutionary computation has also been introduced in discriminative sub-graph
mining (GAIA) [35].
For above sub-graph feature selection methods, the graph data are required to fully la-
belled, whereas labeling graph data may be subject to expensive costs. An alternative solu-
tion is to combine both labeled and unlabeled graphs for sub-graph feature selection. Kong
et al. integrates active learning theory into feature and sample selection for graph classifi-
cation [41]. They maximize the dependency between sub-graph patterns and graph labels
by using an active learning framework and search the optimal graph to query for the label.
In addition, a feature evaluation criterion, gSemi, is derived to estimate the significance of
sub-graphs based on both labeled and unlabeled graphs [42], by combining semi-supervised
feature selection and sub-graph feature mining. For large graphs, a novel technique called
D-walks (Discriminative random walks) is developed to tackle semi-supervised classifica-
tion [9]. The class of unlabeled data can be predicted by maximizing the betweenness score
with labeled data. For applications without negative graph samples, PU learning for graph
classification is developed to focus on selecting useful sub-graph features based on positive
and unlabeled graph data only [87].
2.5 DATA STREAM MINING
Our problem is also related to data stream mining. The initial study of data stream mining
was proposed in [21], in which incremental learning was used to tackle the increasing
volumes in the data stream. Some ensemble learning-based methods [69, 77] have also
36
been used to address concept drift and data distribution changes in data streams. All these
frameworks only work for generic data with instance-feature representations, and there is
no feature immediately available for graph streams for learning and classification.
Several studies have investigated the graph stream classification problem. To the best
of our knowledge, only a few works [1, 15, 46] are directly related to our problem. These
methods apply hashing techniques to sketch the graph stream to save computational cost
and control the size of the subgraph-pattern set. In [1], Aggarwal proposes a 2-D random
edge hashing scheme to construct an “in-memory” summary for sequentially presented
graphs and uses a simple heuristic to select a set of most discriminative frequent patterns
for building a rule-based classifier. Although this method has demonstrated promising per-
formance on graph stream classification, it has two inherent limitations: (1) The selected
sub-graphs contain disconnected edges, which may have less discriminative capability than
connected subgraph-patterns because of structure meaning deficiency. (2) A frequent pat-
tern mining process is required to perform on the summary table that comprises massive
transactions, resulting in high computational cost. In [15], a clique hashing strategy is
used to improve computational efficiency for graph stream task, but the authors still use
traditional coarse-grained graph representation which causes severe information loss in
mapping process and further decreases the classification accuracy. And in [46], the au-
thors proposed a hash kernel to project arbitrary graphs onto a compatible feature space for
similarity computing, but this technique can only be applied to node-attributed graphs.
2.6 REAL-TIME ANALYSIS
As we mentioned above, real-time analysis is to capture the changes at each time period,
which is used to improve mining performance. Zhu and Shasha have proposed techniques
to compute some statistical measures over time series data streams [90]. The proposed
techniques use discrete Fourier transform. The system is called StatStream and is able to
compute approximate error bounded correlations and inner products. The system works
over an arbitrarily chosen sliding window. Lin et al. have proposed the use of symbolic
representation of time series data streams [48]. This representation allows dimensional-
37
ity/numerosity reduction. They have demonstrated the applicability of the proposed repre-
sentation by applying it to clustering, classification, indexing and anomaly detection. Chen
et al. have proposed the application of what so called regression cubes for data streams [11].
Due to the success of OLAP technology in the application of static stored data, it has been
proposed to use multidimensional regression analysis to create a compact cube that could
be used for answering aggregate queries over the incoming streams. This research has
been extended to be adopted in an undergoing project Mining Alarming Incidents in Data
Streams MAIDS. Himberg et al. have presented and analyzed randomized variations of
segmenting time series data streams generated onboard mobile phone sensors [29]. One of
the applications of clustering time series discussed: Changing the user interface of mobile
phone screen according to the user context. It has been proven in this study that Global
Iterative Replacement provides approximately an optimal solution with high efficiency in
running time.
2.7 ROADMAP
The overall roadmap of this thesis is given in Fig.2-2.
Figure 2-2: The overall roadmap of this thesis.
38
Chapter 3
Understanding the Roles of Sub-graph
Features for Graph Classification: An
Empirical Study Perspective
3.1 INTRODUCTION
The advancement of data acquisition and analysis technology has resulted in many appli-
cations involving complex structure data. Examples include Cheminformatics, Bioinfor-
matics, and Social Network Analysis. Different from traditional data that are represented
as feature vectors, the new data are often represented as graphs to denote the content of
the data entries and their structural relationships. This has resulted in an increasing interest
in graph classification, which tries to learn classification model from a number of labeled
graphs to separate previously unseen graphs into different categories. This problem, in
general, can be divided into two large branches: (1) Classification of nodes (or edges) in a
single large graph, like social networks; (2) classification of a set of small graphs, like drug
activity predictions. This experimental study focuses on the latter, i.e the classification of
small graphs.
Although graph representation has some inherent advantages, such as the number of
nodes and edges can vary to best capture the complex relationship between objects, the
39
Figure 3-1: An example of sub-graph pattern representation. Left panel shows two graphs,
G1 and G2 and right panel gives the two indicator vectors showing whether a sub-graph
exists in the graphs.
disadvantage is also obvious: Lack of learning algorithms which are able to support graph
structure data. Finding effective representations for graph data is therefore a major chal-
lenge for graph classification. One possible solution is to define some graph kernels [37]
that use graph structures, such as topology, paths, and edge labels of the graphs, to calculate
distance between a pair of graphs, and then use the distance to formulate a generic learning
algorithm. The second major solution is using substructure patterns (i.e. sub-graphs) as
features to represent each graph as a feature vector. So a training graph set can be con-
verted into a generic training instance set for learning. Fig. 3-1 shows a sub-graph pattern
representation example. More specifically, Sub-graphs g1, g2, and g3 are used to convert
graphs G1 and G2 into feature vectors. Depending on whether g1, g2, and g3 appear in each
respective graph, say G1, a binary feature vector X1 = [110...]T is created to represent
G1. subgraph-based approaches are useful in many graph related tasks, including discrim-
inating different groups of graphs, classifying and clustering graphs and building graph
indices in vector spaces. An inherent advantage of embedding graphs into vector space is
that it makes existing algorithmic tools developed for feature based object representations
available for graph structure data.
In this chapter, we mainly focus on sub-graph pattern mining based graph classifica-
tion. Frequent Sub-graph Mining (FSM) has been an emerging sub-graph mining problem,
40
mainly because sub-graph patterns are meaningful tokens to effectively map graphs into
vector space for clustering or classification. Given a graph dataset, D = {G0, G1, ..., Gn},
assume Supg denotes the number of graphs (in D) which include g as a sub-graph, the ob-
jective of FSM is to find all frequent sub-graphs whose number of occurrence is above the
specified threshold minsup (s.t. Supg ≥ minsup) from the given graph dataset D. Existing
FSM methods can be roughly divided into two categories: (i) Apriori-based approaches,
and (ii) pattern growth-based approaches. The Apriori-based approaches carry out pattern
mining in a generate-and-test manner by using a Breadth First Search (BFS) strategy to
explore the sub-graph lattice of the given database. Three well established Apriori-based
FSM algorithms include AGM, FSG, and DPMine [33, 22, 45, 75]. Pattern growth-based
approaches adopt a Depth First Search (DFS) strategy where, for each discovered sub-graph
g, the sub-graph is extended recursively until all frequent super-graphs of g are discovered.
FSM algorithms that adopt a DFS strategy tend to need less memory because they traverse
the lattice of all possible frequent sub-graphs in a DFS manner. Two well-known example
algorithms are gSpan and FFSM [32, 82].
Given a set of frequent sub-graphs g1, g2, ..., gd, a graph Gx can then be represented as
a feature vector X = [x1, x2, ..., xd]T , where xi = 1 if gi ⊆ Gx; otherwise, xi = 0 [81].
Existing research has demonstrated that such vectorization helps to build efficient indices
to support fast graph search [83]. In addition, this representation can also help to preserve
some basic structural information for graph data analysis.
Several reasons motivate the proposed empirical studies on frequent sub-graphs for
graph classification. Firstly, the mining of the sub-graphs involves graph matching op-
erations that are computationally demanding. Indeed, just testing whether a graph is a
sub-graph of another graph is an NP-complete problem [74]. In Fig. 3-2, we report the
sub-graph mining runtime with respect to an increasing number of edges of the sub-graphs
for a given graph set. The results show that it is very time-consuming to match patterns in
graph datasets, especially for sub-graphs with a large number of edges. Secondly, we are
interested in finding discriminative power of sub-graphs. In other words, we want to know
how much the classification accuracy can be affected by different properties of the selected
sub-graphs, such as different number of nodes, different number of edges, and the size of
41
Figure 3-2: The runtime of frequent sub-graph pattern mining with respect to the increasing
number of edges of sub-graphs.
the selected sub-graph set. To achieve the goal, a typical objective function, information
gain (IG) [36], is used to test the discriminative power of attributions (sub-graphs). Ex-
periments show that the objective function is neither monotonic nor anti-monotonic with
respect to the size of the sub-graphs [81]. In addition, the internal structural correlation
between sub-graphs can also impact on the classification result. For instance, sub-graphs
with similar structures tend to have similar objective scores, whereas using sub-graphs with
high correlations for classification should be avoided because it will introduce redundant
features with high dependency that may deteriorate the classification accuracy. This fact
can help to select better sub-graphs to represent graph datasets. Thirdly, most given graph
classification methods are computationally expensive, we are wondering whether there is
any inexpensive way to achieve graph classification. More specifically, if we use some ran-
dom sub-graphs to represent a graph dataset, how good will the classification accuracy be,
compared to the approaches that use sophisticated sub-graph mining algorithms?
In this chapter, we present an empirical comparison of graph classification based on
different sets of sub-graph features, including frequent sub-graphs, random sub-graphs,
and sub-graphs selected using information gain measure. The comparisons are carried out
by using different number of sub-graphs, and sub-graphs with different edges. Experiments
are performed on seven real-world graph datasets from three domains, by using two popular
learning algorithms including support vector machines and nearest neighbour. The studies
42
show that using random sub-graph features has a reasonably good performance compared
to its expensive peers which use frequent sub-graphs. Experiments also show that sub-
graphs with many edges are not necessarily good candidates for graph classification.
The remainder of the chapter is organized as follows. Section 3.2 formulates the prob-
lem and defines some important notations. Experimental studies are reported in Section 3.3.
3.2 PROBLEM FORMULATION
In this section, we introduce several basic concepts related to graph representation, frequent
sub-graph mining, and graph classification.
3.2.1 Graph and Sub-graph
DEFINITION 5 Graph: LV and LE are finite label sets which denote nodes and edges,
respectively. A graph g = (V,E, μ, ν), where:
• V denotes a finite set of nodes,
• E ⊆ V × V is the set of edges,
• μ : V → LV is the node labeling function, and
• ν : E → LE is the edge labeling function.
The number of nodes of a graph g is denoted by |g|, while D represents the set of all
graphs over the label alphabets LV and LE .
DEFINITION 6 Sub-graph: Let g1 = (V1, E1, μ1, ν1) and g2 = (V2, E2, μ2, ν2) be two
graphs, Graph g1 is a sub-graph of g2, denoted by g1 ⊆ g2, if
• V1 ⊆ V2,
• E1 ⊆ E2,
• μ1(v) = μ2(v) for all v ∈ V1, and
43
• ν1(e) = ν2(e) for all e ∈ E1.
An example is shown in Fig. 3-3, where the label of each node is color coded (the same
color means the same label). Graph (b) is a sub-graph of (a).
Figure 3-3: A conceptual view of graph vs. sub-graph. (b) is a sub-graph of (a).
3.2.2 Frequent Sub-graph Mining
DEFINITION 7 Support: Given a graph database D, the support of a sub-graph g, de-
noted by Supg, is the fraction of the graphs in D of which g is a sub-graph, formally:
Supg =|{g′ ∈ ζ|g ∈ g′}|
|D| ,
Given a user specified minimum support threshold minsup and a graph database D, a
frequent sub-graph is a sub-graph whose support is at least minsup (i.e. SupD ≥ minsup)
and the frequent sub-graph mining problem is to find all frequent sub-graphs in D.
DEFINITION 8 Graph Isomorphism: There are two graphs g1 = (V1, E1, μ1, ν1) and
g2 = (V2, E2, μ2, ν2), a graph isomorphism is a bijective function f : V1 → V2 satisfying
• μ1(v) = μ2(f(v)) for all nodes v ∈ V1,
• for each edge e1 = (v1, v2) ∈ E1, there exists an edge e2 = (f(v1), f(v2)) ∈ E2 such
that ν1(e1) = ν2(e2), and
• for each edge e2 = (v1, v2) ∈ E2, there exists an edge e1 = (f−1(v1), f−1(v2)) ∈ E1)
such that ν1(e1) = ν1(e2).
Two graphs are called isomorphic if there exists an graph isomorphism between them.
44
Figure 3-4: An example of graph isomorphism.
DEFINITION 9 Sub-graph Isomorphism: Given two graphs g1 = (V1, E1, μ1, ν1) and
g2 = (V2, E2, μ2, ν2), an injective function f : V1 → V2 from g1 to g2 is a sub-graph
isomorphism if there exists a sub-graph g ⊆ g2 such that f is a graph isomorphism between
g1 and g.
3.2.3 Graph Classification
DEFINITION 10 Graph Embedding: Let D be a set of graphs. A graph embedding is a
function ϕ : D → Rn mapping graphs to n-dimensional vectors, i.e.,
ϕ(g) = (x1, x2, ..., xn)′,
In Table 3.1, we summarize the advantages and disadvantages of graph representation
comparing with vector representation. Embedding graphs into vector spaces makes existing
algorithmic tools developed for feature based object representations immediately available
for graph classification.
After mapping graph data into vector space, graph classification is almost the same as
learning and classifying generic instances represented in the feature space.
3.3 EXPERIMENTAL STUDY
In this section we first discuss the benchmark data and the experimental settings, and then
report detailed experimental results and analysis.
45
Table 3.1: The advantages and disadvantages comparisons between vector representation
vs. graph representation
3.3.1 Benchmark Data
We carry out our experimental studies on seven real-world graph datasets, which are col-
lected from three different domains: Chemical compound, Protein structures, and Citation
network.
Chemical Compound: The activity of chemical compound molecules can be predicted by
their special 3-dimensional structures. In our experiments, we use a series of binary-label
graph datasets from the PubChem website1. This website provides real-world datasets on
the biological activities of small molecules, containing the bioassay records for anti-cancer
screen tests with different cancer cell lines collecting from National Cancer Institute (NCI).
Each dataset belongs to a certain type of cancer screen with active or inactive response (i.e.
class labels) [81]. Because each NCI bioassay dataset contains very few active graphs, we
use under-sampling to down-sample inactive graphs to form a relatively balanced dataset
for performance evaluation. The number of vertices in most of those compounds ranges
from 10 to 200. We use 5 graph datasets in the experiments and the information about
these data are reported in Table 3.2.
Protein Structure: Proteins are organic compounds made of amino acids sequences joined
together by peptide bonds. A huge amount of proteins have been sequenced over years, and
the structures of thousands of proteins have been resolved so far. The well known role of
proteins in the cell is as enzymes which catalyze chemical reactions [38]. The D&D dataset
we used contains 1178 protein structures that can be divided into two classes: 691 enzymes
1http://pubchem.ncbi.nlm.nih.gov
46
Table 3.2: NCI datasets used in experiments
Datasets Data SizeVertex
Size
Edge
Size
# Node
Labels
# Edge
Labels
NCI33 2934 30.2 32.5 39 3
NCI410 2881 21.5 30.1 42 3
NCI2302 13764 23.9 27.7 38 3
NCI489028 11092 33.5 30.7 42 3
NCI485346 18426 29.5 31.4 44 3
and 487 non-enzymes [20]. Each protein is represented by a graph, in which the nodes
are amino acids and two nodes are connected by an edge if they are less than 6 Angstroms
apart. These proteins, with an average size of 285 vertices and 716 edges, are larger and
stronger connected than molecules from the NCI screening.
Table 3.3: DBLP dataset used in experiments
Classes Descriptions # Papers # Graphs
DBDM
SIGMOD,VLDB,ICDE, EDBT,PODS,
DASFAA,SSDBM,CIKM,DEXA,
KDD, ICDM, SDM, PKDD, PAKDD
20601 9530
CVPRICCV, CVPR, ECCV, ICPR, ICIP,
ACM Multimedia, ICME18366 9926
DBLP Citation Network: The DBLP dataset is composed of bibliography data in the field
of computer science2. Each record in DBLP is associated with a number of attributes such
as paper ID, authors, years, title, abstract and reference ID [71]. We build a binary-label
graph dataset by using papers published in a list of conferences (as shown in Table 3.3).
The classification task is to predict whether a paper belongs to the field of DBDM (database
and data mining) or CVPR (computer vision and pattern recognition), by using references
and title of each paper. In our experiments, each paper in DBLP is represented as a graph,
where each node denotes a Paper ID or a keyword and each edge denotes the citation
relationship between papers or keywords appeared in the paper’s title. More specifically,
2http://arnetminer.org/citation
47
we denote that (1) each paper ID is a node; (2) if a paper A cites another paper B, there is an
edge between A and B; (3) each keyword in the title is also a node; (4) each paper ID node
is connected to the keyword nodes of the paper; and (5) for each paper, its keyword nodes
are fully connected with each other. An example of DBLP graph data is shown in Figure
3-5, where the rectangles are paper ID nodes and diamonds are keyword nodes. The paper
ID17890 cites (connects) paper ID17883 and ID18068, and ID17890 has keywords Patch,
Motion, and Invariance in its title. Paper ID18068 has keyword Edge and Detection, and
paper ID17883’s title includes keywords Vision and Edge. For each paper, the keywords in
the title are linked with each other.
Figure 3-5: Graph representation for a paper (ID17890) in DBLP. Node in red is the main
paper. Nodes in black ellipse are citations. While nodes in black box are keywords.
3.3.2 Sub-graph Features
To understand the relationship between sub-graph features and the classification accuracy,
we carry out sub-graph feature selection by using different sizes of sub-graphs (with respect
to the number of edges), which are from one to nine (sub-graphs with more than nine edges
are much less frequent). In Table 3.4, we report the number of sub-graph features with
respect to different sizes (i.e. number of edges) in the benchmark datasets. Meanwhile, we
also vary the number of select sub-graph features (i.e. the size of the “selected feature set”)
to include 50, 1000, and 2000 sub-graphs respectively, in all experiments.
To select sub-graph features, we use frequency and Information Gain (IG) based feature
selection criteria, respectively. For frequency based criterion, we simply select sub-graphs
48
Table 3.4: Number of sub-graphs with respect to different sizes (i.e. number of edges)
Datasets
#Edges1 2 3 4 5 6 7 8 9
NCI33 134 437 1361 >2000 >2000 >2000 >2000 >2000 >2000
NCI410 53 174 1512 >2000 >2000 >2000 >2000 >2000 >2000
NCI2302 140 449 1706 >2000 >2000 >2000 >2000 >2000 >2000
NCI489028 48 417 1298 >2000 >2000 >2000 >2000 >2000 >2000
NCI485346 39 140 562 1172 >2000 >2000 >2000 >2000 >2000
D&D 190 1356 >2000 >2000 >2000 >2000 >2000 >2000 >2000
DBLP >2000 >2000 >2000 >2000 >2000 >2000 >2000 >2000 >2000
with the highest frequency. For IG based approach, we calculate Information Gain (IG)
between each sub-graph and the class label, and select the ones with the highest IG scores
as sub-graph features. It is worth noting that IG is commonly used for sub-graph feature
selection and most significant patterns likely fall into the high-quantile of frequency [81].
In addition to frequency and Information Gain based sub-graph features, we also em-
ploy a random feature selection approach. In our experiments, we first collect all one-edge
sub-graphs from the training set. After that, multiple-edge sub-graphs are generated by
randomly combining one-edge sub-graphs. If a multiple-edge sub-graph appears more
than once in training set, this sub-graph will be considered as a possible random sub-graph
feature. This process is equivalent to random sub-graphs collecting from the whole struc-
ture space. Because our experiments focus on sub-graph features with different edges, we
generate a sub-graph with X edges by using X one-edge sub-graphs selected in a random
manner (X = 1, 2, · · · , 9). This process is much more efficient and more feasible than
generating all possible features from training set (which is computationally infeasible).
Among all four benchmark datasets, DBLP is a special graph dataset. The node space in
DBLP is very sparse, because there are thousands of nodes with different labels (as shown
in Table 3.3) (the nodes in DBLP represent paper ID and paper keywords). As a result,
the most discriminative features in DBLP are one-node sub-graphs. So in our experiments,
we also include sub-graphs containing only one-node (0-edge features mean one-node sub-
graphs).
49
3.3.3 Experimental Settings
In our experiments, we use classification accuracy to measure the performance of the al-
gorithm. Suppose the true label set of testing data is Lt, and the Label set returned by
our algorithm is Lr. The accuracy is defined as |Lt ∩ Lr|/|Lr|. In our experiments, We
use 10-fold cross-validation to evaluate and compare the algorithm’s performance. Each
graph dataset is evenly partitioned into 10 parts. Only one part is used as test set and the
other nine parts are used for sub-graph mining, sub-graph feature selection, and classifier
generation. To reduce variability, the validation was collected from the average of 10 times
cross-validation. To train classifiers from graph data, we use Support Vector Machines
(Lib-SVM based MATLAB package3) and Nearest Neighbor algorithm (NN) [27]. All
experiments are conducted on machines with 4GB RAM and Intel CoreTM i5 CPUs of
3.10 GHz.
3.3.4 Results and Analysis
In this section we report the graph classification accuracy with respect to four major fac-
tors: (1) the sub-graph feature sizes; (2) the size of the sub-graph feature set; (3) different
learning algorithms; and (4) different benchmark datasets. In Figures 3-6, 3-7, and 3-8, we
report the algorithm performance with respect to different sizes of sub-graph features, by
using Support Vector Machines (Figures 3-6 and 3-7) and Nearest Neighbour (Figure 3-8)
learning algorithms. More specifically, in Fig. 3-6, figures in each row correspond to one
benchmark dataset. The left, middle, and the right panel each corresponds to 50, 1000, and
2000 sub-graph features, respectively. For each figure, the x-axis denotes the size (i.e. the
number of edges) of the selected sub-graph features, and the y-axis denotes the classifica-
tion accuracy. While in Fig. 3-7, figures in each row correspond to one benchmark dataset.
The left, middle, and the right panel each corresponds to 50, 1000, and 2000 sub-graph fea-
tures, respectively. For each figure, the x-axis denotes the size (i.e. the number of edges)
of the selected sub-graph features, and the y-axis denotes the classification accuracy. Each
curve in the figure corresponds to the classification accuracies using sub-graph features se-
3http://www.csie.ntu.edu.tw/�cjlin/libsvm
50
lected from different approaches. Each curve in the figure corresponds to the classification
accuracies using sub-graph features selected from different approaches. In Fig. 3-8, figures
in each row correspond to one benchmark dataset. The left, middle, and the right panel
each corresponds to 50, 1000, and 2000 sub-graph features, respectively. For each figure,
the x-axis denotes the size (i.e the number of edges) of the selected sub-graph features,
and the y-axis denotes the classification accuracy. Each curve in the figure corresponds to
the classification accuracies using sub-graph features selected from different approaches.
In each figure, each row corresponds to one benchmark dataset. The left, middle, and the
right panel each corresponds to the results of using 50, 1000, and 2000 sub-graph features,
respectively. The x-axis in each figure denotes the size (i.e the number of edges) of each
sub-graph features, and the y-axis denotes the classification accuracy. Each curve corre-
sponds to the classification accuracies using sub-graph features selected from random sub-
graph features, frequency based sub-graph features, and Information Gain based sub-graph
features, respectively.
The results in Figures 3-6, 3-7, and 3-8 suggest the following major findings.
The discriminative power of sub-graphs varies by their sizes: The results show that the
size of sub-graph features has significant impact on its discriminative power. For NCI data,
features with 4 to 7 edges have good classification performance. For D&D protein data,
features with 1 to 3 edges are better than others. For DBLP data, features with one-node
are more discriminative. The variance of the discriminative power of sub-graphs is mainly
attributed to the domain characteristics of the graph data:
• NCI graphs have very similar structures between each other. Examples include ben-
zene rings, which can be observed in many graphs. Each NCI dataset directly focuses
on a special type of compounds (or structures). So most NCI data are very similar in
structures.
• Although D&D graphs only have less than 44 node labels, these graphs have very
complex structures, and the distribution of the data is more scattered than NCI data.
Because there are up to thousands of enzymes found in the biosphere. The large
structure differences result in less graphs sharing the same sub-graphs with relatively
51
1 2 3 4 5 6 7 8 90.55
0.6
0.65
0.7
0.75NCI33 with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(a)
1 2 3 4 5 6 7 8 9
0.65
0.7
0.75
0.8NCI33 with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(b)
1 2 3 4 5 6 7 8 90.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78NCI33 with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(c)
1 2 3 4 5 6 7 8 90.5
0.6
0.7
0.8
0.9NCI410 with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(d)
1 2 3 4 5 6 7 8 90.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8NCI410 with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(e)
1 2 3 4 5 6 7 8 90.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8NCI410 with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(f)
1 2 3 4 5 6 7 8 90.61
0.62
0.63
0.64
0.65
0.66
0.67NCI2302 with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(g)
1 2 3 4 5 6 7 8 90.62
0.63
0.64
0.65
0.66
0.67
0.68
0.69NCI2302 with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(h)
1 2 3 4 5 6 7 8 90.62
0.64
0.66
0.68
0.7NCI2302 with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(i)
1 2 3 4 5 6 7 8 90.45
0.5
0.55
0.6
0.65
0.7NCI485346 with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(j)
1 2 3 4 5 6 7 8 90.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74NCI485346 with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(k)
1 2 3 4 5 6 7 8 90.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74NCI485346 with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(l)
1 2 3 4 5 6 7 8 90.54
0.56
0.58
0.6
0.62
0.64NCI489028 with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(m)
1 2 3 4 5 6 7 8 90.54
0.56
0.58
0.6
0.62
0.64NCI489028 with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(n)
1 2 3 4 5 6 7 8 9
0.57
0.58
0.59
0.6
0.61
0.62
0.63
NCI489028 with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(o)
Figure 3-6: Classification accuracy on five NCI chemical compound datasets with respect
to different sizes of sub-graph features (using Support Vector Machines: Lib-SVM).
52
1 2 3 4 5 6 7 8 90.5
0.55
0.6
0.65
0.7D&D with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(a)
1 2 3 4 5 6 7 8 90.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74D&D with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(b)
1 2 3 4 5 6 7 8 9
0.65
0.7
0.75
0.8D&D with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(c)
0 1 2 3 4 5 6 7 80.4
0.5
0.6
0.7
0.8
0.9DBLP with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(d)
0 1 2 3 4 5 6 7 80.4
0.5
0.6
0.7
0.8
0.9DBLP with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(e)
0 1 2 3 4 5 6 7 80.4
0.5
0.6
0.7
0.8
0.9DBLP with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(f)
Figure 3-7: Classification accuracy on D&D protein dataset and DBLP citation dataset with
respect to different sizes of sub-graph features (using Support Vector Machines: Lib-SVM).
large size. As a result, the most discriminative features of D&D graphs have 1 to 3
edges on average.
• DBLP data is the most sparse graph datasets. There are thousands unique paper
IDs and keywords in the DBLP (each paper ID and keyword represent one node).
Because there are a very large number of node labels, it is hard to find shared sub-
graphs with more than 4 edges from any two DBLP graphs. The most frequent
and the most discriminative features we found are, in fact, one-node, and sub-graph
features with more than 3-edge may only appear very few times. As a result, when
using features with 3 or more edges to build classifiers, the classification accuracy is
about 50% (which is equivalent to random predictions) (as shown in Figure 3-7 (d),
(e), and (f)).
Overall, the experiments suggest that the actual discriminative power of sub-graph varies,
depending on their sizes and the domain of the graph data. Sub-graphs with many edges
are not necessary to be considered for graph classification.
53
1 2 3 4 5 6 7 8 90.45
0.5
0.55
0.6
0.65
0.7NCI33 with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(a)
1 2 3 4 5 6 7 8 90.55
0.6
0.65
0.7
0.75NCI33 with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(b)
1 2 3 4 5 6 7 8 90.55
0.6
0.65
0.7
0.75NCI33 with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(c)
1 2 3 4 5 6 7 8 90.45
0.5
0.55
0.6
0.65
0.7D&D with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(d)
1 2 3 4 5 6 7 8 90.55
0.6
0.65
0.7
0.75D&D with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(e)
1 2 3 4 5 6 7 8 9
0.65
0.7
0.75
0.8D&D with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(f)
1 2 3 4 5 6 7 8 90.4
0.5
0.6
0.7
0.8
0.9DBLP with 50 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(g)
1 2 3 4 5 6 7 8 90.4
0.5
0.6
0.7
0.8
0.9DBLP with 1000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(h)
1 2 3 4 5 6 7 8 90.4
0.5
0.6
0.7
0.8
0.9DBLP with 2000 features
Number of selected edges
Acc
urac
y %
RandomFrequentIG
(i)
Figure 3-8: Classification accuracy on one NCI chemical compound dataset, D&D protein
dataset, and DBLP citation dataset with respect to different sizes of sub-graph features
(using Nearest Neighbours: NN).
Random sub-graphs have a reasonably good performance: compared to sub-graphs
discovered from expensive sub-graph mining and selection criteria, such as frequency based
sub-graphs and information gain based sub-graphs, the classification accuracy of random
sub-graph are reasonably good. Our experiments show that none of the feature selection
approaches can significantly outperform all other methods (including random sub-graph)
on all seven benchmark graph datasets. Although the accuracy of the information gain
based method is often higher than others, in most cases, the difference of accuracy between
random feature selection and frequent feature selection with IG is less than 5%.
Random sub-graphs performs well on graph classification tasks because current graph
54
classification or clustering methods are always based on binary sub-graph representation
and combine with traditional vector-based methods to do classification or clustering. As
a result, this so-called coarse-grained representation model cannot well capture the real
structure of graphs and cause information loss. Different sub-graph selection methods
hardly make up such loss caused by the binary representation mechanism. We will give
a novel
Overall, our experiments suggest that random sub-graphs are useful for graph classifi-
cation. The classifiers built from random sub-graphs are not significantly inferior to other
peers.
Number of sub-graphs is important to ensure good performance: When comparing
three figures in each row (the left, middle, and the right panel represent results correspond-
ing to 50, 1000, and 2000 sub-graph features, respectively), it is clear that increasing the
number of sub-graph features often results in an improved accuracy. For example, when
increasing the number of sub-graphs from 50, to 1000, and to 2000, the average classifica-
tion accuracy of the classifiers trained by using three types of sub-graph features (Random,
Frquent, and IG) for 9-edge features on NCI410 graph dataset will increase from 70%,
73% to 75%. Similar trends can also be observed from other benchmark datasets. For
each specific selected sub-graph feature subset (i.e. a feature set with 1000 sub-graphs),
the accuracies will vary significantly, depending on the size of the sub-graphs in the set. A
larger improvement can be observed for sub-graph feature subset containing more features.
For example, when using 2000 sub-graphs, the difference between the maximum and the
minimum classification accuracies is significantly larger than the difference of using 50
sub-graphs (as shown in Figure 3-6 (d) vs. (f)).
Overall, our experiments suggest that the number of sub-graph is important for achiev-
ing good classification accuracy. High accuracy can be expected if a good number of
sub-graphs are combined with good sizes of sub-graphs.
Increasing number of sub-graphs reduces the difference between classifiers built from
different sub-graphs: The results from all benchmark datasets show that when increasing
the number of sub-graphs, the overall prediction accuracy will also increase (for example,
Figure 3-6 (a), (b) , and (c)). For a small number of sub-graphs, e.g the feature size is 50,
55
the accuracies of the classifiers trained from different sub-graphs are largely different, and
random sub-graphs are inferior to other approaches (as shown in 3-6 (a) , (d) , (g) , (j) ,
and (m), where the differences of accuracy among three methods are from 0% to 17.5%).
When the size of feature set increases to 2000, the accuracies of all three classifiers are very
close to each other. In some cases, the three curves are found to exactly the same, which
suggest that the performance of using random sub-graphs will steadily improve when an
increasing number of sub-graphs is used to train the classifier. Indeed, when a large number
of sub-graph are used, the difference between different feature set will be reduced. In an
extreme case, if all sub-graphs are used to train classifiers, there is no difference between
feature sets, which will therefore result in the same classifier performance.
Overall, our experiments suggest that the difference between classifiers trained by using
different sub-graphs is mostly noticeable when the number of sub-graphs is small. Such
difference will be reduced when the number of sub-graphs increases, and may vanish when
a relative large number sub-graphs, say more than 1000, is used. A direct conclusion from
this observation will suggest that when using more than 1000 sub-graph features to train
classifiers, there is not much difference between using random sub-graphs or using frequent
sub-graphs mined and selected by other expensive methods.
Performance with respect to different classifiers shows similar trend: In Figure 3-8,
we report the results of using Nearest Neighbour (NN) classifier on the benchmark data.
Due to space limitations, we only select one NCI graph dataset. Compared with Support
Vector Machines, we can find that the overall accuracies of NN are slightly lower than
SVM. However, the overall trends, such as the increasing/decreasing of the classification
accuracies with respect to the size of sub-graphs (or the number of sub-graphs), are very
similar to the results from SVM. The observations we made from SVM are mostly valid
for NN as well.
Overall, our experiments suggest that there is minor difference between the classifiers
trained by using different learning methods, whereas the relationship between the classifier
and the sub-graphs is mostly the same for different learning algorithms.
In this chapter, we found that current graph classification or clustering methods are
always based on binary sub-graph representation and combine with traditional vector-based
56
methods to do classification or clustering. As a result, number of sub-graphs and sub-graphs
selection methods play a huge influence on afterward graph classification or clustering
performances. And what is more, this so-called coarse-grained representation model cannot
well capture the real structure of graphs and cause information loss. So in next chapter,
we will propose a new fine-grained representation model to enjoy a rigorous theoretical
foundation that ensures minimal information loss for graph representation, and also avoids
the expensive sub-graph isomorphism validation process, which is proven to be NP-hard.
57
58
Chapter 4
Graph Hashing and Factorization for
Fast Graph Stream Classification
4.1 INTRODUCTION
As discussed in last chapter, number of sub-graphs and sub-graphs selection methods play a
huge influence on afterward graph classification or clustering performances. What is more,
unlike traditional instance-feature representations, graph-structure data do not have specific
features to represent the graph, so an essential step in building a graph classification model
is to explore graph features to represent graph data in an instance-feature space for effective
learning [57]. The majority of existing graph classification models employ an occurrence-
based feature representation model, in which a set of sub-graph features is selected to
represent the graph data by using the occurrence of the sub-graph in the graph (either 0/1
occurrence or actual number of occurrences) as the feature value. In this chapter, we refer
to such an occurrence based representation model as coarse-grained representation model.
An example is shown in Fig. 4-1, where the substructures g1, g2, and g3 are used to represent
Graph Gi. While the coarse-grained feature representation model has been popularly used
for graph classification, it has a number of disadvantages, including:
• Computation burden for isomorphism validation: Because sub-graphs are se-
lected as features to represent the gra[LP0POph, the expensive sub-graph mining and
59
Figure 4-1: Coarse-grained vs. fine-grained representation.
isomorphism validation process can not be avoided (graph isomorphism is proven to
be NP-hard) [19]. After the sub-graph features have been selected, the isomorphism
validation process has to be applied again to map the graphs into the vector space.
• Severe and unbounded information loss: Because the coarse-grained represen-
tation model does not characterize the degree of closeness of the sub-graph pattern
related to the graph, it suffers severe information loss in representing the graph. More
importantly, the information loss incurred in the representation is unbounded so it is
hard to characterize how closely the sub-graph features can represent the underlying
graph and what is the degree of information loss incurred in the representation.
For a graph stream with continuously growing volumes (i.e. number of graphs) and
changing structures (e.g. new nodes may appear in the coming graphs), the disadvantage of
the existing coarse-grained graph feature model is even greater. This is because (1) stream
volumes continuously increase, which makes it computationally expensive to explore sub-
graph features, and (2) the changes in the graph stream (such as new structures) make
the sub-graph features incapable of representing graphs. For example, if new nodes or
structures appear in the graph stream, the previously discovered sub-graph patterns will not
be able to represent the graphs, because sub-graph patterns do not contain the information
of new nodes and new structures. Although it is always possible to rediscover new patterns
from the most recent data, the mining procedures are normally time-consuming, and any
features discovered may soon be outdated and incapable of representing future graphs.
60
Motivated by the above observations, we propose a fine-grained graph factorization
framework for efficient graph stream classification in this chapter. Being fine-grained, our
mapping framework relies on a set of discriminative frequent cliques, instead of important
sub-graph patterns, to represent the graph data. In addition, such a fine-grained representa-
tion ensures that the final instance-feature representation is sufficiently close to the original
graph, with theoretical guarantee.
To solve the problem, our main idea is to bypass the expensive sub-graph mining pro-
cess and use linear combinations of graph cliques to represent graphs. This is similar to the
Fast Fourier Transform (FFT) [55] in which a set of base functions (i.e. cliques) is used to
form input signals (i.e. graphs). To achieve this goal, our algorithm relies on two important
steps to extract fine-grained graph representation: (1) finding a set of frequent graph cliques
as the base; and (2) using graph factorization to calculate a linear combination of the graph
cliques to best represent a graph.
Compared to the traditional coarse-grained representation model, the advantage of fine-
grained representation is clear. As shown in Fig.4-1, the original graphs are represented in
more detail by the use of fine-grained feature mapping: Given sub-graph features g1, g2,
and g3 and graphs G1 and G2, coarse-grained representation represents Gi as instance Xi
with 0/1 feature values, whereas fine-grained representation represents Gi as instance X ′i
with a fraction number indicating the closeness of the sub-graph gj to the graph Gi. Even
though g3 does not appear in G1, there is still a feature value (0.09). This is because our
framework no longer relies on the occurrences of the sub-graphs but on the linear combina-
tions of a number of base patterns to represent graphs. A clear advantage of our method is
that fine-grained graph representation enjoys a rigorous theoretical foundation that ensures
minimal information loss for graph representation, and also avoids the expensive sub-graph
isomorphism validation process, which is proven to be NP-hard. After applying our fine-
grained feature mapping model to represent graphs, we can use any learning algorithm,
such as Nearest Neighbor or Support Vector Machines, for graph stream classification. Our
method offers a number of advantages including a fast feature mining process, more precise
graph representation, and better performance for graph stream classification. Experiments
on two real-world network graph data demonstrate that our method outperforms state-of-
61
the-art approaches in both classification accuracy and runtime efficiency.
The remainder of this chapter is organized as follows. Section 4.2 introduces problem
definition. We describe the proposed graph factorization method in Section 4.3. Section 4.4
describes the system framework FGSC for graph stream classification, with experimental
studies reported in Section 4.5.
4.2 PROBLEM DEFINITION
In this chapter, we propose a fast graph stream classification method using graph factor-
ization to address the aforementioned problems and achieve better performance in both
classification accuracy and runtime efficiency.
DEFINITION 11 (Connected Graph): A graph is represented as G = (V , E ,L), where
V = {v1, v2, · · · , vnv} is the set of vertices, E ⊆ V × V denotes a set of edges, and
L = {l1, l2, · · · , lnl} is the set of symbol labels for vertices and edges. A connected graph
is a graph such that there is a path between any pair of vertices.
DEFINITION 12 (Adjacency Matrix): An adjacency matrix of G (which contains Q
unique nodes) is denoted by A(G) ∈ RQ×Q, where each entry aij denotes the weight of
the edge from vertex vi to vertex vj . And aii means the label on node vi.
We use the adjacency matrix as the graph representation and assume for simplicity that
each edge has a default weight 1. A graph Gi is a labeled graph if a class label yi ∈ Yis assigned to Gi. For the binary classification problem, we have yi ∈ Y = {−1,+1}. A
graph Gi is either labeled (denoted by GLi ) or unlabeled (denoted by GU
i ).
DEFINITION 13 (Clique): A clique c = (V ′, E ′,L′) in a graph G = (V , E ,L) is a sub-
graph of G, where V ′ ⊆ V , E ′ ⊆ E and L′ ⊆ L such that every two vertices in c are
connected by an edge.
DEFINITION 14 (Graph Stream): A graph stream S contains an increasing number of
graphs arriving in a streaming fashion with S = {G1, G2, · · · , Gi, · · · }. At any particular
62
time point, we can collect a batch of graphs (B) for analysis. Formally, S =∑∞
j=1 Bj ,
where Bj = {Gj1 , Gj2 , · · · , Gjkj}.
The aim of graph stream classification is to learn a classification model from S, at any
particular time point, by scanning the graph stream only once, and predicting the class
labels of future arrival graphs in the graph stream with maximal accuracy.
4.3 GRAPH FACTORIZATION
In this section, we present the model used to describe the relationship between cliques
and graphs, and propose a graph factorization algorithm to generate fine-grained graph
representation.
4.3.1 Factorization Model
Given a graph encoded by a binary symmetric adjacency matrix, our goal is to use dis-
criminative feature-patterns to represent the graph precisely. In this chapter, we propose to
use cliques as feature-patterns to represent graphs. There are several advantages of using
cliques as feature-patterns: (1) Finding cliques does not require a complicated sub-graph
mining process (which is computationally expensive). Although finding maximal cliques is
an NP-complete problem (similar to the sub-graph mining problem), the special structure
allows us to develop a fast algorithm to find cliques (details will be addressed in Section
5). (2) Because there is an edge between any two nodes vi and vj in a clique, there is no
need to describe the edges in the clique, so we can simply use all the nodes appearing in a
clique to form a vector to represent the clique.
We assume that the size of the feature set (which is a clique set), denoted by N , and the
size of graph node space, denoted by Q, are given. Our model is parameterized by wz and
mjz, where wz denotes the expected weight of clique cz, and mjz represents the occurrence
of vertex j in clique cz. The expected weight of the edge that lies between vertices i and
j in clique cz can then be re-written by mizwzmjz. If we sum up the cliques in a given
clique-pattern set C, where |C| = N , the expected weight of the edge that lies between
63
Figure 4-2: An example of graph factorization.
vertices i and j can be represented as
aij =∑cz∈C
mizwzmjz (4.1)
From Eq. (4.1), we have an expected adjacency matrix A as follows, where M is a
Q×N binary matrix and W is a N ×N diagonal matrix.
A(G) = MWMT (4.2)
In Eq.(4.2), each row of M means a node index which is the same as the node index
in A(G). Each column of M corresponds to the occurrences of the nodes in a selected
clique-pattern c ∈ C (Fig. 4-2 shows an example of M and A(G) of G1 in Fig. 4-1): Given
graph G1 and three cliques c1, c2, and c3, the clique set matrix (M) records whether a
node (row) appears in each of the three cliques (columns). Graph factorization uses M as
a base to find a diagonal matrix W1 which best approximates the adjacency matrix with
A(G1) ≈ A(G1) = MW1MT . The diagonal values in matrix W1 are used as feature
64
values to represent graph G1. More explicitly,⎛⎜⎜⎜⎜⎜⎜⎝a11 · · · a1Q
.... . .
...
aQ1 · · · aQQ
⎞⎟⎟⎟⎟⎟⎟⎠ = M
⎛⎜⎜⎜⎜⎜⎜⎝w1 · · · 0
.... . .
...
0 · · · wN
⎞⎟⎟⎟⎟⎟⎟⎠MT (4.3)
where aij is an element of A(G).
The decomposition of the adjacency matrix A(G) can be interpreted in terms of over-
lapping weight theory. More specifically, denote the ith column of M as M·i. We let
Θi = M·iMT·i , and Eq. (4.3) can then be rewritten as:
A(G) =N∑i=1
wiΘi (4.4)
We refer to Θi as the basic element. Because the ith column of M is a clique, Θi is
the binary matrix in which each element of the diagonal means a vertex of the graph that
does or does not appear in this clique and other nodes in the binary graph matrix represent
edge information in the clique. We use W ∈ RN×N to denote the clique interaction matrix,
which can also be considered as the clique degree matrix. As a result, Eq.(4.4) means that
a graph can be represented as a set of cliques with different weight values (wi). For a given
M, we can decompose a graph into a diagonal matrix W = [w1, w2, · · · , wN ]T , which
can be used to represent the graph. This is equivalent to mapping graphs into a vector
space. Unfortunately, in reality, M is a non-square matrix and Wk may not exist such that
A(Gk) = MWkMT for a given graph Gk.
Thus, we want to find an optimal Wk that satisfies,
A(Gk) ≈ MWkMT (4.5)
65
By using squared loss to measure the relaxation error, we have,
RE(A(Gk),W,M) = ||MWMT − A(Gk)||2
= ||A(G)− A(Gk)||2(4.6)
where RE(·) is the function of computing relaxation error. Then the optimization prob-
lem of Eq.(4.5) can be formulated as
Wk = argminW
RE(A(Gk),W,M) (4.7)
DEFINITION 15 (Fine-grain graph representation): Given a set of cliques c1, c2, . . . , cn
and a set of graphs G1, G2, . . . , Gm. For each graph Gk, if there is a vector Wk satisfied
Wk = argminW RE(A(Gk),W,M), we call Wk the Fine-grain representation for Gk
under c1, c2, . . . , cn.
4.3.2 Learning Algorithm
To solve Eq.(4.7), by combining Eqs.(4.2), (4.3), and (4.4), we have
Wk = argminW
||N∑i=1
wi · M·iMT·i − A(Gk)||2 (4.8)
Wk = argminW
||N∑i=1
wi ·Θi − A(Gk)||2 (4.9)
We define vec(·) as the matrix vectorized operator.
Wk = argminW
||N∑i=1
wi · vec(Θi)− vec(A(Gk))||2 (4.10)
Let P = [vec(Θ1), vec(Θ2), · · · , vec(ΘN)]. Then
Wk = argminW
||P · diag(W )− vec(A(Gk))||2 (4.11)
where diag(·) denotes the main diagonal of a matrix.
66
To solve Eq.(4.11), we employ the General Linear System Theorem proposed in [62].
THEOREM 1 (General Linear System Theorem): Let there exist a matrix B such that
By is a minimum 2-norm least-squares solution of a linear system Ax = y. Then it is
necessary and sufficient that B = A†, the Moore-Penrose generalized inverse of matrix A.
Remark. According to Theorem 1, we have the following properties for our proposed
graph mapping algorithm:
• The special solution x0 = A†y is one of the least-squares solutions of a general linear
system,
||Ax0 − y||2 = ||AA†y − y||2 = minx
||Ax− y||2
• The special solution x0 = A†y has the smallest 2-norm among all least-squares solu-
tions of Ax = y:
||x0|| = ||A†y|| ≤ ||x||,
∀x ∈ {x : ||Ax− y||2 ≤ ||Az − y||2, ∀z ∈ Rn}
• The minimum 2-norm least-squares solution of Ax = y is unique, which is x = A†y.
Based on Theorem 1, the smallest 2-norm least-squares solution of the above linear
system (Eq.(4.11)) is
diag(Wk) = P † · vec(A(Gk)) (4.12)
where P † is the Moore-Penrose generalized inverse of matrix P .
By using the above factorization results, we can use diag(Wk) to map each graph Gk
into a vector space, which is called fine-grained representation for graph Gk. Because
our factorization process ensures that the multiplication of Clique Set Matrix M and the
diagonal matrix W can best approximate the adjacency matrix of graph Gk with A(Gk) ≈A(Gk) = MWkMT , our method has the following three key advantages:
• Error Bound for Graph Representation: The difference between the original and
the approximated adjacency matrices, A(Gk) vs. A(Gk), quantitatively measures
67
the information loss of graph representation. We can further adjust the Clique Set
Matrix M to ensure an information loss upper bound. This provides a clear theoret-
ical base for graph representation, which the existing coarse-grained representation
cannot achieve.
• Avoid Sub-graph Pattern Mining: The inputs of the graph factorization process
include adjacency matrix A(Gk) and the Clique Set Matrix M. In the next section
we will show that finding the clique set matrix is much more efficient than finding
frequent sub-graph patterns. In addition, our method does not need sub-graph match-
ing for new graphs to generate vectors as traditional methods do. Because the Clique
Set Matrix M has already been generated from the training data, we only need to
use matrix operations to map the test graphs into a vector space, which is more effi-
cient than sub-graph matching. As a result, our approach avoids expensive sub-graph
pattern mining for efficient graph stream classification.
• Better Representation for Graph Streams: In graph stream scenarios, the graph
data and structures may change continuously. As a result, it is very difficult to find
representative substructures to represent graph streams (even if the computational
cost is not an issue). The fine-grained representation uses linear combinations of
base cliques to approximate each graph. Because cliques are basic graph units, which
remain relatively stable in graph streams, our approach provides a better presentation
framework for continuously changing graph streams.
4.4 FAST GRAPH STREAM CLASSIFICATION
In this section, we first introduce the overall framework for graph stream classification and
then address details of the method in respective subsections.
4.4.1 Overall Framework
The proposed graph stream classification framework (FGSC), is illustrated in Fig. 4-3,
which contains three key modules:
68
Figure 4-3: The framework of FGSC for graph stream classification.
• Clique Mining: To tackle continuously increasing graph stream volumes and ex-
panding graph nodes and structures, we collect graph data into batches with each
batch containing a number of graphs. For each single graph Gk in a batch, we use a
node hashing strategy to map the unlimited node space into a fixed-size node hashing
set, so our method can effectively handle graph streams with expanding new nodes
and structures. We then decompose the compressed graph into a number of cliques
by using the maximal clique representation method.
• Clique Set Matrix and Graph Factorization: The above clique mining process
may find an increasing number of clique-patterns, whereas we can only select a fixed
number of cliques as a base for factorization. Therefore, we propose a clique hashing
strategy to find discriminative frequent clique-patterns to form the Clique Set Matrix
(M). It is worth noting that the matrix M, in the graph stream scenarios, is discov-
ered from current and previous batches so we can find a good set of base cliques for
graph factorization, as shown in the dotted box in Fig. 4-3. As a result, all graphs in
the current batch Bi can be cast into a vector space through matrix factorization.
• Graph Stream Classification: After mapping the graphs into the vector space, a
69
classifier is trained from the current batch Bi. The graphs in the next chunk Bi+1
are collected as a test set (which is denoted by Gtest in Fig. 4-3). To ensure that the
vector space-based classifier properly classifies the test graphs, each graph in Bi+1 is
processed using the factorization module and is converted into vector space by using
generated W for classification.
4.4.2 Graph Clique Mining
As shown in Fig. 4-3, instead of relying on expensive frequent sub-graph mining to capture
graph features, we propose to use clique-based graph factorization to represent the graph
data. Our first step is to select a set of cliques from each graph in the stream. In stream sce-
narios, the node set of the graph stream can be extremely large and unlimited. In addition,
new nodes and structures may erupt and appear in the graphs, and it is therefore necessary
to compress each incoming graph into a fixed node space. We use a random hash function
to map the original unlimited node set onto a significantly compressed node set Ω of size
Q. In other words, nodes in the compressed graph of Gk will be re-indexed by {1, · · · , Q}.
Since multiple nodes in Gk may be hashed onto the same index, the weight of one edge
in the compressed graph is set to the number of edges between the original nodes that are
hashed onto the compressed nodes. After hashing all nodes into a fixed-size space, each
graph Gk ∈ S is transferred into a compressed graph G∗k for clique mining. We define
this node hashing strategy using G∗k := NodeHash(Gk, Q). An example of graph clique
mining is illustrated in Fig. 4-4, where step (B) shows the compressed graph.
To discover cliques from compressed graphs, we propose the following fast approach.
According to the graph compression results, the weight of one edge in the compressed
graph denotes the number of edges that exist in the original graph. Our clique mining
process can take this information into consideration. More specifically, our maximal clique
set mining process includes:
• Finding Maximal Cliques: We first discover any maximal cliques, CG∗k, from the
compressed graph G∗k (where no maximal cliques in CG∗
kshare any overlapping
edge). Because all nodes in compressed graph G∗k are unique, finding maximal
70
11 2
12 1
2
2
1
Node Hashing
G Compressed G
1
11
11
11
1
11
11
11 1
1
1
1
1 20
1 01
1
0
11
1
1 20
0 00
0
0
1
0 00
0 00
0
0
2
c1 c2 c3 c41
1
(A) (B) (C) (D) (E)
Figure 4-4: An example of clique mining in a compressed graph.
cliques from G∗k is very efficient (we use the Bron-Kerbosch maximal clique find-
ing algorithm in our experiments). An example of the maximal clique c1 is shown in
Fig. 4-4 (C) upper panel;
• Clique Extraction: We set the weight of each edge in the discovered maximal clique
c ∈ CG∗k
equal to the smallest weight values of the edges in c. After that, we decrease
the weights of the edges in compressed graph G∗k by the weight of the corresponding
edges in CG∗k
and generate a new weighted graph G′k. An example of the updated
graph G′k, after extracting clique c1 is shown in Fig. 4-4 (C) lower panel;
• Loop: If the weight of any edge in the updated graph G′k is 0, it means that the edge
has already been removed, and no longer needs to be considered in the following
clique mining process. Continuing the loop between the above two steps will dis-
cover all cliques until all weight values of edges in G′k are equal to 0. An example is
shown in Fig. 4-4 (E) lower panel.
It is worth noting that because all nodes in the compressed graph G∗k are unique, graph
G∗k can completely restored (i.e. rebuilding G∗
k by using all discovered cliques). This en-
sures that there is no information loss during graph decomposition. Algorithm 1 describes
the detailed process of clique mining, where MaximalCliques(G′k) finds the maximal
71
cliques from G′k. MinimalWeight(c) sets input clique c’s edge weight to the smallest
weight value of all edges in c. UpgradeWeights(G′k) updates the weight values of graph
G′k.
Algorithm 1 Clique Mining
Input: graph Gk
1: G∗k ← NodeHash(Gk, Q);
2: G′k ← G∗
k;3: Ckout ← φ;4: while The weight of each edge in G′
k unequal to 0 do5: CG′
k← MaximalClques(G′
k);6: for all c ∈ CG′
kdo
7: c ← MinimalWeight(c);8: end for9: Ckout ← Ckout ∪ CG′
k;
10: G′k ← UpgradeWeights(G′
k)11: end whileOutput: Ckout ;
An example of the graph clique mining is illustrated in Fig. 4-4, where nodes of the
same color are compressed into one node, with the weight values in the compressed graph
being changed accordingly. After the clique mining process, graph G is decomposed into
a clique set Cout = {C1, C2, C3, C4} for graph factorization.
4.4.3 Clique Set Matrix and Graph Factorization
In this subsection, we introduce the detailed process for finding a set of discriminative
frequent cliques to form the Clique Set Matrix (M), and then use graph factorization for
feature mapping.
Discriminative Frequent Cliques
To find a good set of cliques to form base M for graph factorization, we use correlations
between each clique and graph labels (i.e. similar to Information Gain [44] and G-test
score [66]) to find cliques that are highly correlated to the classification task. For fast
discriminative clique finding, we use an “in-memory” Clique-class table Γ with (Q+ |Y|+2) columns to count the frequency of each clique with respect to each class label.
72
An example of an “in-memory” clique-class table is shown in Fig. 4-5, where Q = 4
is the size of compressed node set, which corresponds to the first 4 columns. Cols. 5 and
6 give the statistical information of two classes. Col. 7 is the value of random function
H . The final column is the objective score. ck is a new clique whose hashing result is
1531. We only compare ck with c1 and c3 and find that it is equal to c1. Then we update
the statistical information of c1. Q is the size of the compressed node set Ω and |Y| is the
number of graph class labels (The table in Fig. 4-5 contains four nodes and two classes,
i.e. Q = 4 and |Y| = 2). A clique c and its statistical information is recorded in one row,
where first Q columns represent the occurrence of the nodes in c (In Fig. 4-5, clique c1
contains nodes 1, 3, and 4). The frequencies of the clique with respect to each class (Yi)
are recorded in the next |Y| columns. In Fig. 4-5, c1 appears 85 times in one class and 20
times in another class. The clique hashing output, which helps to speed up the update of the
statistical information of each clique-pattern, is recorded in one column. In Fig. 4-5, the
7th column records the hashing output of c1 with 1531. The last column is used to record
the discrimination-test scores (objective scores) of each clique.
For each graph Gk in the stream, we first use Algorithm 1 to collect its clique set Ckout .
After that, for each clique in Ckout , say ck,j , we then apply a random hash function h(·) to
the string of ordered edges in ck,j to generate an index Hk,j ∈ {1, 2, · · · , τ}, where τ is a
control parameter. Then we check the second-last column of Γ to validate whether there
is a value equal to Hk,j . In other words, we only compare the nodes of ck,j with the nodes
on Γpo, where Hk,j = Γpo, o = Q + |Y| + 1. If they are the same clique, we update the
information by adding 1 to the corresponding class Label column to which Gk belongs.
If ck,j does not appear in the current Γ, we add a new row at the end of Γ to record this
clique. The clique hashing strategy used here is to accelerate clique matching. Instead of
comparing all cliques, we only compare cliques with the same hashing results.
Given the “in-memory” Clique-class table Γ, we can find discriminative frequent clique-
patterns by using two parameters: frequency threshold α and clique feature number m. We
sort all cliques whose frequencies are equal to or greater than α by using their objective
scores (e.g. IG) in descending order (the objective scores are listed in the last column of Γ).
We then use the top m cliques with the highest scores to generate Clique Set Matrix M.
73
Figure 4-5: An example of “in-memory” Clique-class table Γ.
Algorithm 2 Finding Discriminative Frequent Cliques
Input: Training graph set G, frequency threshold α and feature number m1: for all Gk ∈ G do2: Ckout ← CliqueMining(Gk);3: Γ ← Γ
⊎Ckout ;
4: end for5: Γ ← Select(Γ, α);6: Sort(Γ);7: M ← Top(Γ,m);
Output: M;
Algorithm 2 shows the procedure of discriminative frequent clique-pattern finding, in
which CliqueMining(·) is Algorithm 1, operator “⊎
” is updating Γ by comparing one
clique with Γ. Select(Γ, α) is a function to find cliques whose frequencies are equal to
or greater than α. Sort(Γ) sorts cliques in descending order according to their objective
scores and Top(Γ,m) returns top m rows from Γ.
Feature Mapping
Once the Clique Set Matrix M has been suitably generated, we can convert each com-
pressed graph G∗k into the same vector space by using the factorization results proposed.
74
The factorization will generate diagonal matrix Wk under objective Wk = argminW ||MWMT−A(G∗
k)||2 and diag(Wk) can then be used as fine-grained representation for G∗k.
4.4.4 Graph Stream Classification
Given the vector representation of each graph in a batch of graphs (Bi), we can use any
generic learning algorithm, such as Nearest Neighbors (NN), Naive Bayes or support vec-
tor machines (SVM), to train a classifier. The classifier is then used to predict class labels
for new graphs that have yet to arrive. As soon as the new graphs arrive and a new graph
batch is collected and labeled, we can easily update the classification model by training
a new classifier from the new batch. This procedure is detailed in Algorithm 4, where
FeatureMapping(·) is used to obtain fine-grained representation by using graph factor-
ization.
Algorithm 3 Classification
Input: Classifier ζ , GUtest and M
1: GU∗test ← NodeHash(GU
test, Q);2: Vector Vtest ← FeatureMapping(M, GU∗
test);3: ytest ← ζ(Vtest);
Output: ytest;
4.5 EXPERIMENTS
In this section we report the performance of the proposed factorization-based fast graph
stream classification method (FGSC) on real-world graph data. Our experiments mainly
focus on the study of efficiency and effectiveness.
4.5.1 Benchmark Data
We validate the performance of the algorithm on the following two real-world graph streams.
• DBLP Stream 1: DBLP is a computer science bibliography. Each record in DBLP
denotes one publication, and the record is associated with a number of attributes
1http://arnetminer.org/citation
75
such as the paper ID, authors, year, title, abstract, and reference ID [72]. In our ex-
periments, we build a graph stream with binary classification tasks by using papers
published in a list of selected conferences (as shown in Table 4.1). The classifica-
tion task is to predict whether a paper belongs to the field of DBDM (database and
data mining) or CVPR (computer vision and pattern recognition), by using structural
information, such as the title and references, of each paper. In our experiments, we
represent each paper in DBLP as a graph, with each node denoting a paper ID or
a keyword and each edge representing the citation relationship between papers or
keywords appearing in the paper’s title. We make the following designations: (1)
each paper ID is a node; (2) if a paper A cites another paper B, there is an edge
between A and B (we use undirected edge); (3) each keyword in the title is also a
node; (4) each paper ID node is connected to the keyword nodes of the paper; and
(5) the keyword nodes of each paper are fully connected with one another. An exam-
ple of DBLP graph data is shown in Fig.4-6: The rectangles are paper ID nodes and
diamonds are keyword nodes. The paper ID17890 cites (connects) paper ID17883
and ID18068, and ID17890 has the keywords Patch, Motion, and Invariance in its
title. Paper ID18068 has the keyword Edge and Detection, and paper ID17883’s title
includes the keywords Vision and Active. The title keywords in each paper are linked
with one another. In the experiments, all papers are arranged in chronological order
to form a graph stream. It is worth noting that the paper title keyword space is very
large (more than one thousand), and new title keywords may appear at any time, so
the node space is constantly changing and expanding, which represents the inherent
challenge of the graph stream.
Table 4.1: DBLP dataset used in experiments.
76
Figure 4-6: Graph representation for a paper (ID17890) in DBLP.
• IBM Sensor Stream 2: This stream contains information about local traffic on a
sensor network which issues a set of intrusion attack types. Each graph consti-
tutes a local pattern of traffic in the sensor network where nodes denote the IP-
addresses and edges correspond to local traffic patterns (local traffic flows between
IP-addresses) [1]. The selected data contains a stream of intrusion graphs from June
1, 2007 to June 3, 2007. Each graph is associated with a particular intrusion type and
there are over 300 different local patterns in the dataset. We only choose 20 typical
intrusion types (10 types of “BWEST” pattern and 10 types of “SNMP” pattern) that
are considered as class labels in our experiments. Our goal is to classify a traffic
flow pattern as one of the 20 intrusion types. More details about this data can be
found in [1]. Compared to the DBLP stream, each graph in the IBM stream has far
fewer nodes and the node labels have fewer overlaps between different graphs, which
makes the problem particularly difficult for accurate classification. We intentionally
choose a graph structure that is very different from the DBLP stream, therefore the
results on IBM and DBLP streams can help to evaluate the algorithm’s performance
on different types of graphs.
2http://www.charuaggarwal.net/sens1/gstream.txt
77
4.5.2 Experimental Settings
Baseline Methods: To evaluate the efficiency and effectiveness of our graph stream classi-
fication framework, we compare the proposed FGSC (FG+Stream) with two other meth-
ods.
• Coarse-grained representation-based method (CG+Stream) Compared with our
fine-grained representation for graphs, this method uses traditional coarse-grained
feature representation model which uses the occurrence of the discriminative frequent
clique to represent graph data (the selected cliques and the classification methods are
the same as our proposed method, except that CG+Stream uses 0/1 occurrence as
the feature value). This method is similar to a clique hashing-based graph stream
classification method in [15].
• 2-D edge hash-compressed stream classifier (EH+Stream) This method employs
a 2-D random edge hashing scheme to construct an “in-memory” summary for the
sequentially presented graphs [1]. The first random-hash scheme is used to reduce
the size of the edge set. The second min-hash scheme is used to dynamically update a
number of hash-codes, which is able to summarize frequent patterns of co-occurrence
edges observed so far in the graph stream. A simple heuristic is used to select a set of
most discriminative frequent patterns to build a rule-based classifier for classification.
In the experiments, we use 10-fold cross-validation to evaluate the performance of the
graph stream classification. For fair comparison, all three methods (FG+Stream, CG+Stream,
and EH+Stream) use different chunk sizes and feature numbers to predict the class labels
of graphs in the stream. For EH+Stream, Nearest Neighbor classifier (NN) is used as the
classifier in their method, so we cannot apply other learning algorithms to this method.
For the other two methods (FG+Stream and CG+Stream), we use three different classifiers,
including Nearest Neighbor (NN), Support Vector Machines (SMO), and Naive Bayes, to
form an ensemble and predict the class labels of graphs in a future chunk. The default
parameter settings are as follows: Batch size |D| = {600, 800, 1000} (for DBLP) and
{300, 400, 500} (for IBM), feature size |m| = {62, 142, 307} with frequency threshold
78
α = {2%, 1%, 0.5%} (for DBLP) and |m| = {43, 75, 148} with α = {0.3%, 0.1%, 0.06%}respectively (for IBM). All experimental results are collected from a Linux cluster comput-
ing node with an Intel(R) Xeon(R)@3.33GHZ CPU and 4GB fixed memory size.
4.5.3 Experimental Results
In this section, we report the algorithm performance with respect to the classification accu-
racy and classification efficiency of the graph stream.
Graph Steams Classification Accuracy
Results on different chunk sizes |D|: In Figs. 4-7 and 4-10, we report the algorithm
performance by using different numbers of graphs in each chunk |D| (varying from 1000,
800, to 600 for the DBLP stream and from 500, 400, to 300 for the IBM stream).
Overall, the results in Fig. 4-7 show that the accuracies of FG+Stream on DBLP stream
are better than that of CG+Stream and EH+Stream. The EH+Stream method has the worst
performance, and its accuracy is slightly over 50%, which is nearly equivalent to random
predictions. The accuracy of FG+Stream, on all three experiments, is consistently better
than that of CG+Stream and EH+Stream, even though the accuracy fluctuates to a large
extent on some chunks (such as Batch 14 in Fig. 4-7 (b)). Recall that the only difference
between FG+Stream and CG+Steam is that the former uses fine-grained feature representa-
tions whereas the latter uses 0/1 occurrence (both methods have the same set of discrimina-
tive cliques). As a result, the better performance of FG+Stream, compared to CG+Stream,
can be attributed to the fact that fine-grained representation provides more accurate in-
formation to describe each single graph and ensures minimal information loss for graph
representation. As we have explained in Section 4 (Graph Factorization), FG+Stream uses
a linear combination of cliques to represent each graph, so even if some features do not
appear in a specific graph, they may still have a value to indicate each clique’s correlation
to the graph. By contrast, CG+Stream can only assign zero value to indicate that the fea-
ture does not appear in the graph. As a result, the feature information given by fine-grained
representation ensures good accuracy for graph classification.
79
1 3 5 7 9 11 13 15 17 19 21 23 250.3
0.4
0.5
0.6
0.7
0.8DBLP |D|=1000,|m|=142, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(a)
1 3 5 7 9 11 13 15 17 19 21 23 250.3
0.4
0.5
0.6
0.7
0.8DBLP |D|=800,|m|=142, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(b)
1 3 5 7 9 11 13 15 17 19 21 23 250.3
0.4
0.5
0.6
0.7
0.8DBLP |D|=600,|m|=142, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(c)
Figure 4-7: Accuracy w.r.t different chunk sizes on DBLP Stream. The number of features
in each chunk is 142. The batch sizes vary as: (a) 1000; (b) 800; (c) 600.
1 3 5 7 9 11 13 15 17 19 21 23 250.3
0.4
0.5
0.6
0.7
0.8
0.9DBLP |D|=1000,|m|=307, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(a)
1 3 5 7 9 11 13 15 17 19 21 23 250.3
0.4
0.5
0.6
0.7
0.8DBLP |D|=1000,|m|=142, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(b)
1 3 5 7 9 11 13 15 17 19 21 23 250.3
0.4
0.5
0.6
0.7
0.8DBLP |D|=1000,|m|=62, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(c)
Figure 4-8: Accuracy w.r.t different number of features on DBLP Stream with each chunk
containing 1000 graphs. The number of features selected in each chunk is: (a) 307; (b)
142; (c) 62.
1 3 5 7 9 11 13 15 17 19 21 23 250.3
0.4
0.5
0.6
0.7
0.8DBLP |D|=1000,|m|=142, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(a)
1 3 5 7 9 11 13 15 17 19 21 23 250.6
0.65
0.7
0.75
0.8DBLP |D|=1000,|m|=142, SMO
Batch ID
Acc
urac
y %
FG+StreamCG+Stream
(b)
1 3 5 7 9 11 13 15 17 19 21 23 250.62
0.64
0.66
0.68
0.7DBLP |D|=1000,|m|=142, NaiveBayes
Batch ID
Acc
urac
y %
FG+StreamCG+Stream
(c)
Figure 4-9: Accuracy w.r.t different classification methods on DBLP Stream with each
chunk containing 1000 graphs, and the number of features in each chunk is 142. The
classification methods selected here are: (a) NN; (b) SMO; (c) NaiveBayes.
The results in Fig. 4-10 show that, for the IBM stream, the accuracies of FG+Stream
and CG+Stream are almost identical across the whole stream. This is mainly because that
each graph in the IBM stream is composed of only a few nodes (most graphs contain less
than four nodes). As a result, the features mined from the training set contain only a few
nodes (e.g. one or two nodes) and the overlap between any two features (i.e. the same nodes
and edges shared by two features) is less frequent. The feature weight values calculated
by our graph factorization method thus degrade to 0/1 occurrence of the features. As a
80
1 3 5 7 9 11 13 15 17 19 21 23 25 270
0.2
0.4
0.6
0.8
1IBM |D|=500,|m|=75, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(a)
1 4 7 10 13 16 19 22 25 28 31 340
0.2
0.4
0.6
0.8
1IBM |D|=400,|m|=75, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(b)
1 5 9 13 17 21 25 29 33 37 41 450
0.2
0.4
0.6
0.8
1IBM |D|=300,|m|=75, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(c)
Figure 4-10: Accuracy w.r.t different chunk sizes on IBM Stream. The number of features
in each chunk is 75. The batch sizes vary from (a) 500; (b) 400; to (c) 300.
1 4 7 10 13 16 19 22 25 28 31 340
0.2
0.4
0.6
0.8
1IBM |D|=400,|m|=148, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(a)
1 4 7 10 13 16 19 22 25 28 31 340
0.2
0.4
0.6
0.8
1IBM |D|=400,|m|=75, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(b)
1 4 7 10 13 16 19 22 25 28 31 340
0.2
0.4
0.6
0.8
1IBM |D|=400,|m|=43, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(c)
Figure 4-11: Accuracy w.r.t different number of features on IBM Stream with each chunk
containing 400 graphs. The number of features selected in each chunk is: (a) 148; (b) 75;
(c) 43.
1 4 7 10 13 16 19 22 25 28 31 340
0.2
0.4
0.6
0.8
1IBM |D|=400,|m|=75, NN
Batch ID
Acc
urac
y %
FG+Stream CG+Stream EH+Stream
(a)
1 4 7 10 13 16 19 22 25 28 31 340.2
0.4
0.6
0.8
1IBM |D|=400,|m|=75, SMO
Batch ID
Acc
urac
y %
FG+StreamCG+Stream
(b)
1 4 7 10 13 16 19 22 25 28 31 340.2
0.4
0.6
0.8
1IBM |D|=400,|m|=75, NaiveBayes
Batch ID
Acc
urac
y %
FG+StreamCG+Stream
(c)
Figure 4-12: Accuracy w.r.t different classification methods on IBM Stream with each
chunk containing 400 graphs, and the number of features in each chunk is 75. The classifi-
cation methods selected include: (a) NN; (b) SMO; (c) NaiveBayes.
result, the vector representation obtained by using FG+Stream is almost the same as that of
CG+Stream, which results in the same classification accuracy. The accuracy of EH+Stream
is significantly worse than the other two methods. As shown in Fig. 4-10 (a), no graph is
correctly classified for chunks 14-15, 17-19, 21, and 25-27. This demonstrates that for
graphs with very few nodes and edges, using clique hashing, as EH+Stream does, cannot
obtain good classification accuracy.
81
Results on different numbers of features |m|: In Figs. 4-8 and 4-11, we report the algo-
rithm’s performance with respect to the different number of features in each chunk.
As expected, FG+Stream has the best performance of the three algorithms for the DBLP
stream. In addition, the results in Fig. 4-8 show that increasing the number of features in the
DBLP stream actually increases the accuracy of the CG+Stream method, while the accu-
racy of FG+Stream method remains relatively stable. This is because CG+Stream directly
uses select clique features to represent a graph whereas FG+Stream uses combinations of
clique features for representation. As a result, FG+Stream has less dependence on the
number of features selected for graph representation.
Interestingly, even though the accuracy of all three methods fluctuates to a large extent
across the whole graph stream, the results in Fig.4-11 show that the accuracy of each is
almost the same within the changing of the feature sizes. This may suggest that for the IBM
stream, a small number of features (such as special IP-addresses or IP-address connections)
may already have enough discriminate power for classification. Increasing the number of
features may only result in redundant features in the learning process, which is not only
time-consuming for learning but also not effective in improving classification accuracy.
Results on different classifiers: For the EH+Stream method, the authors use a nearest
neighbor classifier that scans the training data and re-organizes edges into graphs, and then
finds the closest graph for the test instance [1]. Essentially, this is a Nearest Neighbor
classifier, so we cannot apply other classifiers to the EH+Stream and validate its perfor-
mance. We validate the performance of FG+Stream and CG+Stream by using three ma-
jor classifiers, including Nearest Neighbor (NN), Support Vector Machines (SVM), and
Naive Bayes. The results in Fig. 4-9 show that, for DBLP data, Naive Bayes has the worst
performance compared to NN and SVM, and when using Naive Bayes, the accuracies of
FG+Stream and CG+Stream fluctuate significantly across the whole stream. Meanwhile,
NN and SVM have comparable performance in most cases. For the IBM stream, the clas-
sification accuracies of all three classifiers are very close to one another.
82
1 5 9 13 17 21 250
2
4
6
8
10
12
14Accumulated Time on DBLP stream
Batch ID
Acc
umul
ated
Tim
e (
105 m
s)
FG+Stream CG+Stream EH+Stream
X
(a)
1 4 7 10 13 16 19 22 25 28 31 340
2
4
6
8
10Accumulated Time on IBM stream
Batch ID
Acc
umul
ated
Tim
e (
105 m
s)
FG+Stream CG+Stream EH+Stream
X
(b)
Figure 4-13: System accumulated runtime-based by using NN classifier, where |D| =1000, |m| = 142 (for DBLP) and |D| = 400, |m| = 75 (for IBM) respectively. (a) Results
on DBLP stream; (b) Results on IBM stream.
Graph Steam Classification Efficiency
In Fig. 4-13, we report the system runtime performance (efficiency) for graph stream pro-
cessing. The results show that FG+Stream always requires less time than CG+Stream. This
is mainly because FG+Stream avoids the expensive sub-graph isomorphism validation pro-
cess for vector generation, which is very time-consuming. EH+Stream performs better
than the other two methods on the IBM stream and has a very fast runtime for the first few
chunks. This is because EH+Stream always keeps a feature table of fixed size. The features
in the table will be replaced once a better feature is detected. The size of graphs in the IBM
stream is very small (only several nodes) compared to the DBLP stream, so features are
found which lead to less feature valuation and replacement for updating.
Overall, the results show that FG+Stream linearly scales to the number of chunks. The
runtime of FG+Stream is acceptable compared to its high classification accuracy. As a
result, FG+Stream is a practical solution for handling real-world high speed graph streams.
83
84
Chapter 5
Super-graph based Classification
5.1 INTRODUCTION
Many applications, such as social networks and citation networks, commonly use graph
structure to represent data entries (i.e. nodes) and their structural relationships (i.e. edges).
When using graphs to represent objects, all existing frameworks rely on two approaches
to describe node content (1) node as a single attribute: each node has only one attribute
(single-attribute node). A clear drawback of this representation is that a single attribute
cannot precisely describe the node content [40]. This representation is commonly referred
to as a single-attribute graph (Fig.5-1 (A)). (2) node as a set of attributes: use a set of
independent attributes to describe the node content (Fig.5-1 (B)). This representation is
commonly referred to as an attributed graph [8, 14, 80].
Both single-attribute and attributed graphs emphasize the dependency structures be-
tween objections (i.e. nodes) but inherently overlook internal structures inside each node,
where the attributes/properties used to describe the node content may be subject to depen-
dency structures as well. Indeed, in many applications, the attributes/properties used to
describe the node content may be subject to dependency structures. For example, in a ci-
tation network each node represents one paper and edges denote citation relationships. It
is insufficient to use one or multiple independent attributes to describe detailed informa-
tion of a paper. Instead, we can represent the content of each paper as a graph with nodes
denoting keywords and edges representing contextual correlations between keywords (e.g.
85
Figure 5-1: (A): a single-attribute graph; (B): an attributed graph; and (C): a super-graph.
co-occurrence of keywords in different sentences or paragraphs). As a result, each paper
and all references cited in this thesis can form a super-graph with each edge between pa-
pers denoting their citation relationships. While in a protein interaction network each node
represents one protein and edges denote interaction relationships. A protein may undergo
reversible structural changes in performing its biological function and these changes are
likely not shown in sequence representation [85]. So just using sequence is insufficient to
describe a protein as shown in Fig. 5-2, where each protein is represented as a super-node
by using its molecular structure. Proteins form a super-graph with edges between proteins
denoting their interaction relationships. Two interacted proteins may share functional or
structure similarity. In this chapter, we refer to this type of graph, where the content of the
node can be represented as a graph, as a “super-graph”. Likewise, we refer to the node
whose content is represented as a graph, as a “super-node”.
Given a set of super-graphs, classification problem is to identify which category a super-
graph belongs, based on the known information, like labeling information, super-node
structure, node content, etc. To build learning models for super-graph based classification,
the mainly challenge is to properly calculate the distance between two super-graphs.
• Similarity between two super-nodes: Because each super-node is a graph, the over-
lapped/intersected graph structure between two super-nodes reveals the similarity
between two super-nodes, as well as the relationship between two super-graphs. Tra-
ditional hard-node-matching mechanism is unsuitable for super-graphs that require
soft-node-matching.
86
Figure 5-2: A conceptual view of a protein interaction network using super-graph repre-
sentation.
• Similarity between two super-graphs: The complex structure of super-graph re-
quires that the similarity measure considers not only the structure similarity, but also
the super-node similarity between two super-graphs. This cannot be achieved with-
out combining node matching and graph matching as a whole to assess similarity
between super-graphs.
The above challenges motivate the proposed Weighted Random Walk Kernel (WRWK)
for super-graphs. In this chapter, we generate a new product graph from two super-graphs
and then use weighted random walks on the product graph to calculate similarity between
super-graphs. A weighted random walk denotes a walk starting from a random weighted
node and following succeeding weighted nodes and edges in a random manner. The weight
of the node in the product graph denotes the similarity of two super-nodes. Given a set
of labeled super-graphs, we can use an weighted product graph to establish walk-based
relationship between two super-graphs and calculate their similarities. After that, we can
obtain the kernel matrix for super-graph classification. Experiments on two real-world
super-graph classification tasks demonstrate that the proposed method significantly outper-
forms baseline approaches.
The remainder of the chapter is structured as follows. Section 5.2 formulates the prob-
lem and defines important notations. The weighted random walk kernel is proposed in
87
Section 5.4. The super-graph classification algorithm is given in Section 5.5, followed by
theoretical analysis and theorems in Section 5.6. Experiments are reported in Section 5.7.
5.2 PROBLEM DEFINITION
DEFINITION 16 (Single-attribute Graph) A single-attribute graph is represented as g =
(V,E,Att, f), where V = {v1, v2, · · · , vn} is a finite set of vertices, E ⊆ V × V denotes
a finite set of edges, and f : V → Att is an injective function from the vertex set V
to the attribute set Att = {a1, a2, · · · , am}. An attribute ai is a single symbol, e.g. a
keyword, denoting node content.−−−−→Att(g) = [a1(g), · · · , am(g)] means the attribute vector.
ai(g) =∑n
j=1 h(f(uj), ai), where h(f(uj), ai) = 1, if f(uj) = ai and h(f(uj), ai) = 0,
otherwise.
DEFINITION 17 (Super-graph and Super-node) A super-graph is represented as G =
(V , E ,G,F), where V = {V1, V2, · · · , VN} is a finite set of graph-structured nodes. E ⊆V × V denotes a finite set of edges, and F : E → G is an injective function from E to
G, where G = {g1, g2, · · · , gM} is the set of single-attribute graphs. A node in the super-
graph, which is represented by a single-attribute graph, is called a Super-node.
Formally, a single-attribute graph g = (V,E,Att, f) can be uniquely described by its
attribute and adjacency matrices. The attribute matrix ϕ is defined by ϕri = 1 ⇔ ai ∈f(vr), otherwise ϕri = 0. The adjacency matrix θ is defined by θij = 1 ⇔ (vi, vj) ∈ E,
otherwise θij = 0. Similarly, the adjacency matrix Θ of an super-graph G = (V , E ,G,F)is defined by Θij = 1 ⇔ (Vi, Vj) ∈ E , otherwise Θij = 0. However, because of the
complicated structure of super-graph, we cannot use an unique matrix to describe its super-
node information. To calculate conveniently as this article shows later, an super attribute
matrix Φ ∈ RN×S is defined for super-graph G, where N = |V| and S = |Attg1 ∪ Attg2 ∪
· · · ∪ AttgM |. Φri = 1 ⇔ ai ∈ Attgr , otherwise Φri = 0.
A super-graph Gi is a labeled graph if a class label L(Gi) ∈ Y = {y1, y2, · · · , yx} is
assigned to Gi. A super-graph Gi is either labeled (GLi ) or unlabeled (GU
i ).
88
DEFINITION 18 (Super-graph Classification) Given a set of labeled super-graphs DL =
{GL1 , G
L2 , · · · }, the goal of super-graph classification is to learn a discriminative model
from DL to predict some previously unseen super-graphs DU = {GU1 , G
U2 , · · · } with maxi-
mum accuracy.
5.3 OVERALL FRAMEWORK
In Fig. 5-3, we list the major steps of our super-graph based classification framework.
WRWK on the super-graphs (G, G′) and the single-attribute graphs (g1, g2, g3), where
g1⊗2 and g1⊗3 are the single-attribute product graphs for (g1, g2) and (g1, g3), respectively,
and G⊗ is the super product graph for (G,G′). (θ1, ϕ1) and (Θ,Φ) are the adjacency and
attribute matrices for g1 and G, respectively. w1⊗2 and w are the weight vectors for g1⊗2 and
G⊗, respectively. Each element of w is equal to the WRWK of two super-nodes (as dotted
arrows show). A random walk on single-attribute product graph, say g1⊗3, is equivalent
to performing simultaneous random walks on graphs g1 and g3, respectively. Assume that
v1(a1)means node v1 with attribute a1 in g1. The walk v1(a1) → v3(a5) → v4(a1) in g1 can
match the walk v1′′(a1) → v3′′(a5) → v2′′(a1) in g3. The corresponding walk on the single-
attribute product graph g1⊗3 is: v11′′(a1) → v33′′(a5) → v42′′(a1). k(g1, g2) and k(g1, g3)
are the kernels. k(g1, g2) < k(g1, g3) means g3 is more similar to g1 than g2. Likewise,
K(G,G′) is the WRWK of G and G′. The essential challenge is twofold: (1) Super-node
similarity measurement; and (2) Super-graph similarity measurement.
5.4 WEIGHTED RANDOM WALK KERNEL
Defining a kernel on a space X paves the way for using kernel methods [60] for classifica-
tion, regression, and clustering. To define a kernel function for super-graphs, we employ a
random walk kernel principle [25].
Our Weighted Random Walk Kernel (WRWK) is based on a simple idea: Given a pair
of super-graphs (G1 and G2), we can use them to build a product graph, where each node
of the product graph contains attributes shared by super-nodes in G1 and G2. For each node
89
Figure 5-3: WRWK on the super-graphs (G, G′) and the single-attribute graphs (g1, g2,g3).
in the product graph, we can assign a weight value (which is based on the similarity of the
two super-nodes generating the current node). Because a weighted product node means the
common weight of the node appeared in both G1 and G2, we can perform random walks
through these nodes to measure the similarity by counting the number of matching walks
(the walks through nodes containing intersected attribute sets) and combining weights of
the nodes. The larger the number, the more similar the two super-graphs are. Adding
weight value to the random walk is meaningful. It provides a solution to take graph match-
ing of super-nodes into consideration to calculate the similarity between super-graphs. We
consider the similarity between any two super-nodes as the weight value instead of just us-
ing 1/0 hard-matching. The node similarities between different super-nodes represent the
relationship between two graphs in a more precise way, which, in turn, helps improve the
classification accuracy. Accordingly, we divide our kernel design into two parts based on
the similarity of super-nodes and super-graphs.
Graph kernel provides a natural tool to answer the above questions. Among others, one
of the most powerful graph kernel is based on random walks, and has been successfully
applied to many real world applications. There are many advantages of using random walk
kernel. Firstly, the matching paths between two graphs consider the node similarities as
90
only paths with same sequential attributed nodes can be matched. Secondly, random walk
kernel method also consider the structure similarities as it traverses all matching paths
between two graphs with no tottering. These paths contain all structure information and
have been approved by previous work that better accuracy can be achieved on classification
or clustering benchmarks.
5.4.1 Kernel on Single-attribute Graphs
We firstly introduce the weighted random walk kernel on single-attribute graphs.
DEFINITION 19 (Single-attribute Product Graph) Given two single-attribute graphs
g1 = (V1, E1, Att1, f1) and g2 = (V2, E2, Att2, f2), their single-attribute product graph
is denoted by g1⊗2 = (V ∗, E∗, Att∗, f ∗) (g⊗ for short), where
• V ∗ = {v|v =< v1, v2 >, v1 ∈ V1, v2 ∈ V2};
• E∗ = {e|e = (u′, v′), u′ ∈ V ∗, v′ ∈ V ∗, f ∗(u′) = φ, f ∗(v′) = φ,
u′ =< u1, u2 >, v′ =< v1, v2 >, (u1, v1) ∈ E1, (u2, v2) ∈ E2};
• Att∗ = Att1 ∪ Att2;
• f ∗ = {f ∗(v)|f ∗(v) = f(v1) ∩ f(v2), v =< v1, v2 >, v1 ∈ V1, v2 ∈ V2}.
In other words, g⊗ is a single-attribute graph where a vertex v is the intersection be-
tween a pair of nodes in g1 and g2. There is an edge between a pair of vertices in g⊗, if and
only if an edge exists in corresponding vertices in g1 and g2, respectively. An example is
shown in Fig. 5-3 (g1⊗2). In the following, we show that an inherent property of the prod-
uct graph is that performing a weighted random walk on the product graph is equivalent to
performing simultaneous random walks on g1 and g2, respectively. So the single-attribute
product graph provides an effective way to count the number of walks combining weight
values on nodes between graphs without expensive graph matching.
To generate g⊗’s adjacency matrix θ⊗ from g1 and g2 by using matrix operations, we
define the Attributed Product as follow:
91
DEFINITION 20 (Attributed Product) Given matrices B ∈ Rn×n, C ∈ R
m×m and
H ∈ Rn′×m′
, the attributed product B � C ∈ Rnm×nm and the column-stacking opera-
tor vec(H) ∈ Rn′m′
are defined as
B � C = [vec(B∗1C1∗) vec(B∗1C2∗) · · · vec(B∗nCm∗)],
vec(H) = [H�∗1 H
�∗2 · · · H�
∗n]�,
where H∗i and Hj∗ denote ith column and jth row of H , respectively.
Based on Def. 20, the adjacency matrix θ⊗ of the single-attribute product graph g⊗ can
be directly derived from g1(θ1, ϕ1) and g2(θ2, ϕ2) as follow:
θ⊗ = (θ1 � θ2) � vec(ϕ1ϕ�2 ) (5.1)
where
B �
⎡⎢⎢⎢⎢⎢⎢⎣x1
...
xn
⎤⎥⎥⎥⎥⎥⎥⎦ =
⎡⎢⎢⎢⎢⎢⎢⎣B11 ∧ x1 · · · B1n ∧ x1
.... . .
...
Bn1 ∧ xn · · · Bnn ∧ xn
⎤⎥⎥⎥⎥⎥⎥⎦ ,
“∧” is a conjunction operation (a ∧ b = 1 iff a = 1 and b = 1).
As a result of the above process, we can matrix operation to count random walk. More
specifically, for the adjacency matrix θx of a graph gx, each element [θzx]ij of this nth power
matrix provides the number of walks of length z from ui to uj in gx.
According to Eq. (5.1), performing a random walk on the single-attribute product graph
g⊗ is equivalent to performing simultaneous random walks on the graphs g1 and g2. After
g⊗ is generated, Attributed Random Walk Kernel (ARWK), which computes the similarity
between g1 and g2, can be defined with a sequence of weights δ = δ0, δ1, · · · (δi ∈ R and
δi ≥ 0 for all i ∈ N):
K(g1, g2) = ρ
n1n2∑i,j=1
[ ∞∑z=1
δz θz⊗
]ij
(5.2)
92
where n1 and n2 are the node sizes of g1 and g2, respectively. ρ is the control parameter
used to make the kernel function convergence. As a result, kernel values are upper bounded
(the proof is given later).
To compute the ARWK for single-attribute graphs, as defined in Eq. (5.2), a diago-
nalization decomposition method [25] can be used. Because θ⊗ is a symmetric matrix, the
diagonalization decomposition of θ⊗ exists: θ⊗ = T HT −1, where the columns of T are its
eigenvectors, and H is a diagonal matrix of corresponding eigenvalues. The kernel defined
in Eq. (5.2) can then be rewritten as:
K(g1, g2) = ρ
n1n2∑i,j=1
[ ∞∑z=1
δz (T HzT −1)
]ij
= ρ
n1n2∑i,j=1
[T (
∞∑z=1
δz Hz) T −1
]ij
(5.3)
By setting δz = λz/z! in Eq. (5.3), and use ex =∑∞
z=0 xz/z!, we have,
K(g1, g2) = ρ
n1n2∑i,j=1
[T (eλH − I) T −1
]ij
(5.4)
Where I is an identity matrix with the same size as θ⊗. The diagonalization decompo-
sition can greatly expedite ARWK kernel computation.
Here we set
ρ =1
c(eλ(c−1) − 1), where c =
n1∑i
n2∑j
[ϕ1ϕ�2 ]ij (5.5)
Then we have the following theorem.
THEOREM 2 Given any two single-attribute graphs g1 and g2, the attributed random
walk kernel of these two graphs is bounded by 0 ≤ K(g1, g2) ≤ 1.
The proposed ARWK is different to the traditional random walk kernel, which simply
calculates the total number of shared random walks between two graphs, but the number of
shared walks can increase as the number of nodes/edges increase. As a result, the similarity
based on traditional random walk kernel is not bounded. In comparison, ARWK first finds
all shared nodes between two graphs (where a shared node means a node with the same
attribute in both graphs), and then calculates the ratio between the number of random walks
93
among shared nodes of two graphs and the number of random walks on the complete graph
formed by using all shared nodes (the number of random walks on the complete graph
is calculated by ρ−1 as Eq. (5.5) which is proved in Theorem 4). We call this ratio the
Structural Integrity Ratio, which is proved to be bounded. This not only provides a bounded
measure for similarity assessment, but also provides an effective way to combine attribute
and structure information to assess similarity between two graphs under the shared node
information.
We also call the proposed ARWK as Weighted Random Walk Kernel (WRWK) for
single-attribute graph. The diagonalization decomposition can greatly expedite WRWK
kernel computation. An example of WRWK kernel is shown in Fig. 5-3.
5.4.2 Kernel on Super-Graphs
The WRWK of single-attribute graph helps calculate the similarity between two graphs,
so it can be used to calculate the similarity between two super-nodes. Given a pair of
super-graphs G1 and G2, assume we can generate a new product graph G⊗ whose nodes
are generated by super-nodes in G1 and G2, and weight value of each node is the similarity
between the super-nodes which generate this node, then the same process shown in Sec-
tion 5.4.1 can be used to calculate the WRWK for super-graphs G1 and G2 to denote their
similarity.
DEFINITION 21 (Super Product Graph) Given two super-graphs G1 = (V1, E1,G1,F1)
and G2 = (V2, E2,G2,F2), their Super Product Graph is denoted by G1⊗2 = (V∗, E∗,G∗,F∗)
(G⊗ for short), where
• V∗ = {V |V =< V1, V2 >, V1 ∈ V1, V2 ∈ V2};
• E∗ = {e|e = (V, V ′), V ∈ V∗, V ′ ∈ V∗,F∗(V ) = φ,F∗(V ′) = φ,
V =< V1, V2 >, V ′ =< V ′1 , V
′2 >, (V1, V
′1) ∈ E1, (V2, V
′2) ∈ E2};
• G∗ = G1 ∪ G2;
• F∗ = {F∗(V )|F∗(V ) = F(V1) ∩ F(V2), V =< V1, V2 >, V1 ∈ V1, V2 ∈ V2}.
94
An example of super product graph is shown in Fig. 5-3 (G⊗). Similar to Eq. (5.1),
the adjacency matrix Θ⊗ of the super product graph G⊗ can be directly derived from
G1(Θ1,Φ1) and G2(Θ2,Φ2) as follows:
Θ⊗ = (Θ1 �Θ2) � vec(Φ1Φ�2 ) (5.6)
Because we use the similarity between two super-nodes as the weight value of the node
in the super product graph, the kernel value may increase infinitely with the increasing
size of the super-graph. So for super-graph kernel, we add a control variable η to limit the
range of the super-graph kernel value. Then the Weighted Random Walk Kernel, which
computes the similarity between G1 and G2, can be defined with a sequence of weights
σ = σ0, σ1, · · · (σi ∈ R and σi ≥ 0 for all i ∈ N):
K(G1, G2) = η
N1N2∑i,j=1
[ ∞∑z=0
σz (w w� �Θ⊗)z]ij
(5.7)
where N1 and N2 are the node sizes of G1 and G2, respectively, and
w = [k(g1, g1) k(g1, g2) · · · k(gN1 , gN2)]�
Similar to WRWK of single-attribute graphs in Eq. (5.4), the WRWK on super-graphs
can be calculated by setting σz = γz/z!, we have,
K(G1, G2) = η
N1N2∑i,j=1
[T eγ
˜H T −1]ij
(5.8)
where w w� �Θ⊗ = T HT −1. To ensure kernel function convergence, we set
η =1
N1N2
e1−N2
1N22
N1N2γe2λ (5.9)
where γ is the parameter of WRWK on super-graphs, and λ is the parameter of
WRWK on single-attribute graphs which is given in previous sub-section. The WRWK
of super-graphs is also upper bounded with proof showing in Section 5.6.
95
5.5 SUPER-GRAPH CLASSIFICATION
The WRWK provides an effective way to measure the similarity between super-graphs.
Given a number of labeled super-graphs, we can use their pair-wise similarity to form an
kernel matrix. Then generic classifiers, such as Support Vector Machine (SVM), Decision
Tree (DT), Naive Bayes (NB) and Nearest Neighbour (NN), can be applied to the kernel
matrix for super-graph classification.
Algorithm 4 shows the framework of using WRWK to train a classifier.
Algorithm 4 WRWK Classifier Generation
Input: Labeled super-graph set DL, Kernel parameters λ and γOutput: Classifier ζInitialize: Kernel matrix M ← [ ]1: for any two super-graphs Ga and Gb in DL do2: �w ← [ ]3: for any gx in Ga and gy in Gb do4: q ← (x− 1)Nb + y // q is the element index of �w5: if Attx ∩Atty = φ then6: �wq ← 07: else8: ρ ← 1
nxny, w ← [ 1
nxny
1nxny
· · · 1nxny
], θ⊗ ← (θx � θy) � vec(ϕxϕ�y )
9: �wq ← k(gx, gy) // k(gx, gy) is calculated by eq. (5.4)10: end if11: end for
12: η ← 1NaNb
e1−N2
aN2b
NaNbγe2λ
, Θ⊗ ← (Θa �Θb) � vec(ΦaΦ�b )
13: Mab ← K(Ga, Gb) // K(Ga, Gb) is calculated by eq. (5.8)14: end for15: ζ ← TrainClassifier(C, M , L)
/* C is the learning algorithm. L is the class label vector of DL */
5.6 THEORETICAL STUDY
In this section, we derive theoretical basis to show the rationality of WRWK.
THEOREM 3 The Weighted Random Walk Kernel function is positive definite.
Proof 1 As the random walk-based kernel is closed under products [25] and the WRWK
can be written as the limit of a polynomial series with positive coefficients (as Eqs. (5.4)
and (5.8)), the WRWK function is positive definite.
96
THEOREM 4 Given any two single-attribute graphs g1 and g2, the weighted random walk
kernel of these two graphs is bounded by 0 < k(g1, g2) < eλ, where λ is the parameter of
the weighted random walk kernel.
Proof 2 Because k(g1, g2) is a positive definite kernel by Theorem 3, so k(g1, g2) > 0.
Then we only need to show that the upper bound of k(g1, g2) is eλ.
Based on the definition of WRWK, assume the node sizes of two single-attribute graphs
g1 and g2 are n1 and n2, respectively, the number of random walks on g1 ⊗ g2 must be not
greater than that of the complete connect graph gc which has n1 × n2 nodes. We assume
that gc = g′1 ⊗ g′2, where g′1 and g′2 are with n1 and n2 nodes respectively. So we have
k(g1, g2) < k(g′1, g′2). More specifically,
k(g′1, g′2) = ρ
n1n2∑i,j=1
[ ∞∑z=0
δz (w′ w′� � θ′⊗)
z
]ij
= ρ∞∑z=0
{δzn1n2∑i,j=1
[(w′ w′� � θ′⊗)
z]ij},
Because ρ = 1n1n2
, w′ = [ 1n1n2
, 1n1n2
, · · · , 1n1n2
]�, and gc is a complete connect graph,
where the diagonal elements are equal to 0 and other element in the adjacency matrix are
equal to 1.
Son1n2∑i,j=1
[(w′ w′� � θ′⊗)
z]ij= (
n1n2 − 1
n21n
22
)(z−1)(1− 1
n1n2
),
Then
k(g′1, g′2) =
1
n1n2
∞∑z=0
[δz(n1n2 − 1
n21n
22
)(z−1)(1− 1
n1n2
)],
=1
n1n2
(1− 1
n1n2
)∞∑z=0
[δz(n1n2 − 1
n21n
22
)(z−1)],
Because δz =λz
z!, we have
k(g′1, g′2) =
1
n1n2
(1− 1
n1n2
)∞∑z=0
[λz
z!(n1n2 − 1
n21n
22
)(z−1)],
97
=1
n1n2
(1− 1
n1n2
)(n21n
22
n1n2 − 1)
∞∑z=0
[λz
z!(n1n2 − 1
n21n
22
)z],
Because ex =∑∞
z=0 xz/z!, we have
k(g1, g2) < k(g′1, g′2) = e
λ(n1n2−1)
n21n
22 < eλ.
THEOREM 5 The weighted random walk kernel between super-graphs G1 and G2 is
bounded by 0 < K(G1, G2) < eγe2λ−2λ, where λ and γ are weighted random walk ker-
nel parameters for single-attribute graph and super-graph, respectively.
Similar to Theorem 4, Theorem 5 can be derived.
5.7 EXPERIMENTS AND ANALYSIS
In this section we first discuss the benchmark data and the experimental settings, and then
report detailed experimental results and analysis.
5.7.1 Benchmark Data
We carry out experimental studies on real-world datasets from two different domains:
DBLP scientific publication dataset and Beer Review dataset.
• DBLP Dataset: DBLP dataset consists of bibliography data in computer science1.
Each record in DBLP is a scientific publication with a number of attributes such as
abstract, authors, year, venue, title, and references. To build super-graphs, we se-
lect papers published in Artificial Intelligence (AI: IJCAI, AAAI, NIPS, UAI, COLT,
ACL, KR, ICML, ECML and IJCNN) and Computer Vision (CV: ICCV, CVPR,
ECCV, ICPR, ICIP, ACM Multimedia and ICME) fields to form a classification task.
The goal is to predict which field (AI or CV) a paper (i.e. a super-graph) belongs
to by using the abstract of each paper (i.e. a super-node) and abstracts of references
(i.e. other super-nodes), as shown in Fig. 5-4, where the left figure shows that each
paper cites a number of references. For each paper (or each reference), its abstract
1http://arnetminer.org/citation/
98
(a) (b)
Figure 5-4: An example of using super-graph representation for scientific publications.
can be converted as a graph. So each paper denotes a super-node, and the citation re-
lationships between papers form a super-graph. The sub-figure of Fig.5-4 to the right
shows a graph representation of paper abstract with each node denoting a keyword.
A: The weight values between nodes indicate correlations between keywords. B: by
using proper threshold, we can convert each abstract as an undirected unweighted
graph. An edge (undirected) between two super-nodes indicates a citation relation-
ship between two papers. For each paper, we use fuzzy cognitive map (E-FCM) [50]
to convert paper abstract into a graph (which represents relations between keywords
with weights over a threshold (i.e. edge-cutting threshold)). This graph representa-
tion has shown better performance than simple bag-of-words representation [4]. We
select 1000 papers (500 in each class), each of which contains 1 to 10 references, to
form 1000 super-graphs.
• Beer Review Dataset: The online beer review dataset consists of review data for
beers2. Each review in the dataset is associated with some attributes such as appear-
ance score, aroma score, palate score, and taste score (rating of the product varies
from 1 to 5), and detailed review texts. Our goal is to classify each beer to the right
style (Ale vs. Not Ale) by using customer reviews. The graph representation for re-
views is similar to the sub-graphs in DBLP dataset. Each review is represented as
a super-node. The edge between super-nodes are built using following method: Be-
cause a review has four rating scores in appearance, aroma, palate, and taste, we use
2http://beeradvocate.com/
99
Figure 5-5: Super-graph and comparison graph representations.
these four scores as a feature vector for each review. If two reviews’ distance in the
feature space (Euclidean distance) is less than 2, an edge is used to link two reviews
(i.e super-nodes). We choose 1000 beer products, half from Ale and the rest is from
Large and Hybrid Style, to form 1000 super-graphs for classification.
5.7.2 Experimental Settings
Baseline Methods: Because no existing method can handle super-graph classification,
for comparison purposes, we use two approaches to generate traditional graph representa-
tions from each super-graph: (1) randomly selecting one attribute (i.e one word) in each
super-node (and ignoring all other words) to form a single-attribute graph, as shown in
Fig. 5-5 (C). We repeat random selection for five times with each time generating a set of
single-attribute graphs from super-graphs. Each single-attribute graph set is used to train
a classifier and their majority voted accuracy on test super-graphs is reported in the ex-
periments. (2) We use the attribute set of the single-attribute graph in each super-node as
multi-attributes of the super-node to generate an attributed graph (which is equal to remov-
ing all the edges from the super-graph as shown in Fig. 5-5 (B)). For all the two graph
representations, we use the traditional random walk kernel (RWK) to measure the similar-
ity between any two graphs [25].
We use 10 times 10-fold cross-validation classification accuracy to measure and com-
pare the algorithm performance. To train classifiers from graph data, we use Naive Bayes
(NB), Decision Tree (DT), Support Vector Machines (SVM) and Nearest Neighbor algo-
rithm (NN). Majority examples are based on the kernel parameter settings: λ = 100 and
100
γ = 100.
All experiments are conducted on a machine with 4GB RAM and Intel CoreTM i5 3.10
GHZ CPU.
5.7.3 Results and Analysis
Performance on standard benchmarks: In Fig. 5-6, we report the classification accu-
racy on the benchmark datasets. For DBLP dataset, the number of super-nodes in each
super-graph varies within 1-10 and the edge-cutting threshold for single-attribute graph is
0.001. For Beer Review dataset, the number of super-nodes in each super-graph varies
within 30-60 and the edge-cutting threshold is 0.001. WRWK uses super-graph repre-
sentation (Fig.5-5(A)), RWK1 and RWK2 use traditional random walk kernel method
with attributed graph representation (Fig. 5-5(B)) and single-attribute graph representation
(Fig. 5-5(C)), respectively.. The experimental results show that WRWK constantly out-
performs traditional RWK method, regardless of the type of graph representations used
by RWK. This is mainly because in traditional graph representations each node only has
one attribute or a set of independent attributes, whereas single-attribute nodes (or multiple
independent attributes) cannot precisely describe the node content. For super-graphs, the
graph associated to each super-node provides an effective way to describe the node content.
In WRWK method, we consider the similarity between any two super-nodes as the weight
value instead of just using 1/0 hard matching to represent whether there is an intersection
of attribute sets between two nodes. The soft matching node similarities between super-
nodes capture the relationship between two graphs in a more precise way. This, in turn,
helps improve the classification accuracy. What is more, from Fig. 5-6, we can see that the
accuracy increases with the increasing information content in nodes.
Performance under different super-graph structures: To demonstrate the performance
of our WRWK method on super-graphs with different characteristics, we construct super-
graphs on Beer Review dataset by using different super-node sizes and different structures
of single-attribute graph (the structure of the single-attribute node is controlled by the edge-
cutting threshold) as shown in Fig. 5-7 (a)-(d). Figures correspond to three types of super-
101
(a) (b)
Figure 5-6: Classification accuracy on DBLP and Beer Review datasets w.r.t. different
classification methods (NB, DT, SVM, and NN).
graph structure on Beer Review. Data 1: the number of super-nodes in each super-graph
is 2-30 and the edge-cutting threshold is 0.00001. Data 2: the number of super-nodes in
each super-graph is 30-60 and the edge-cutting threshold is 0.001. Data 3: the number of
super-nodes in each super-graph is > 60 and the threshold is 0.1. The result shows that our
WRWK method is stable on super-graphs with different structures.
Performance w.r.t. changes in super-nodes and walks: The proposed weighted random
walk kernel relies on the similarity between super-nodes and the common walks between
two super-graphs to calculate the graph similarity. This raises a concern on whether super-
node similarity or walk similarity (or both) plays a more important role in assessing the
super-graph similarity.
In order to resolve this concern, we design the following experiments. In the first set
of experiments (Fig. 5-8 (a) and (b)), we fix super-graph edges, and change edges inside
super-nodes, which will impact on the super-node similarities. If this results in significant
changes, it means that super-node similarity plays a more important role. In the second
set of experiments (Fig. 5-8 (c)), we fix super-nodes but vary the edges in super-graphs
(by randomly removing edges), which will impact on the common walks between super-
graphs. If this results in significant changes, it means that walk plays a more important role
than super-nodes.
Fig. 5-8 (a) and (b) report the algorithm performance with respect to the edge-cutting
102
(a) (b)
(c) (d)
Figure 5-7: Classification accuracy on Beer Review dataset w.r.t. different datasets and
classification methods (NB, DT, SVM, and NN).
threshold. The accuracy decreases dramatically with the increase of the edge-cutting thresh-
old for both datasets. When the edge-cutting threshold is set to 0.0001 on DBLP and
0.00001 on Beer Review, the single-attribute graph in each super-node is almost a com-
plete graph and four methods achieve the highest classification accuracy. As the threshold
is set to 0.01 on DBLP and 0.1 on Beer Review, the single-attribute graph in each super-
node is very small and contains very few edges. As a result, the accuracies are just around
65%. This demonstrates that WRWK heavily relies on the structure information of each
super-node to assess the super-graph similarities.
Fig. 5-8 (c) reports the algorithm performance by fixing the super-nodes but gradually
removing edges in the super-graph of Beer Review dataset (as Fig. 5-6 (b)). In (a) and
(b), the node size of super-graph varies from 0-10 on DBLP dataset and 30-60 on Beer
Review dataset. The x-axis shows the value of the edge-cutting threshold and the average
103
(a) (b)
0 25 50 75 100
63
68
73
Performance on Beer data set
Average number of cutting edges
Acc
urac
y %
NBDTSVMNN
(c)
Figure 5-8: The performance w.r.t. different edge-cutting thresholds on DBLP and Beer
Review datasets by using WRWK method.
node degree in its corresponding single-attribute graph is reported in the parentheses. In
(c), the average number of edges cut from each super-graph varies from 0 to 100. Simi-
lar to the result in Fig. 5-8 (a) and (b), the classification accuracy decreases when edges
are continuously removed from the super-graph (even if the super-nodes are fixed). From
Figs. 5-8, we find that graph structure of super-graph is as important as that of super-nodes.
This is mainly attributed to the fact that our weighted random walk kernel relies on both
super-node similarity and walks in the super-graph to calculate graph similarities.
104
Chapter 6
Streaming Network Node Classification
6.1 INTRODUCTION
Recent years have witnessed an increasing number of applications involving networked
data, where instances are not only characterized by the content but are also subject to de-
pendency relationships. For example, each node in a social network can denote one person
and links between nodes can represent their social interactions. For each node in the net-
work, its content can be described using features, such as the bag-of-word representation of
the user’s posts. The mixed node content and structure information raise many unique data
mining tasks, such as network node classification [3] where the goal is to combine node
content and network topology structures to classify unlabeled nodes in the network with a
maximal accuracy. Applications of network node classification include social spammer de-
tection [70], automatic email categorization [52], inferring personality from social network
structures [68], and image classification using social networks [51].
When classifying nodes in networks, existing methods can be roughly categorized into
three groups: (1) combining content and structure features into new feature vector repre-
sentation, such as iterative collective classification [61], and link-based classification [49];
(2) using network paths, such as random walks [18], to determine node labels; and (3)
using content information to build additional structure nodes and generate a new topology
network for classification [2]. The theme of all these methods is to leverage node content
and topology structures to infer correct labels for unlabeled nodes.
105
Figure 6-1: An example of streaming networks, where each color bar denotes a feature.
Existing node classification methods are carried out in a static network setting, with-
out taking evolving network structures and node concepts into consideration. In reality,
changes are essential components in real-world networks, mainly because user participa-
tion, interactions, and responses to external factors continuously introduce new nodes and
edges to the network. In addition, a user may add/delete/modify online posts, which natu-
rally result in changes in the node content. As a result, the networks are inherently dynamic.
In this thesis, we refer to this type of networks, where the network structures and node con-
tent are continuous changing, as Streaming Networks. An example of streaming networks
is shown in Fig. 6-1, where the topology structures and the node feature distributions are
constantly changing with time. Specifically, at time point t2, new nodes (e.g., 4 and 5) and
relevant edges join the network; At t3, Node 5 and the edge between Nodes 1 and 2 are
removed. Over the whole period, node content may continuously change (e.g., the content
change in Node 3). Accurate node classification in a streaming network setting is therefore
much more challenging than static networks. Because existing node classification methods
on static networks cannot capture the changing information which may cause different clas-
sification results, while if we use existing node classification methods repeatedly for each
time point, it is time-consuming. In summary, node classification in streaming networks
has at least three major challenges:
• Streaming network structures: Network topology structures encode rich informa-
tion about node interactions inside the network, which should be considered for node
classification. In streaming networks, topology structures are constantly changing, so
106
node classification needs to rapidly capture and adapt to such changes for maximal
accuracy gain.
• Streaming node features: For each node in a streaming network, its content may
constantly evolve (w.r.t. users posts or profile updating). As a result, the feature space
used to denote the node content is dynamically changing, resulting in streaming fea-
tures [78] with infinite feature space. To capture changes, a feature selection method
should timely select the most effective features to ensure that node classification can
quickly adapt to the new network.
• Continuous Changing Network Volumes: Because node volumes of streaming net-
works are continuously changing, node classification needs to scale to the dynamic
increasing of the network volumes by utilizing models discovered from historical
data to boost the learning on new data.
For streaming networks, changes are introduced through two major channels (1) node
content; and (2) topology structures. To achieve a high node classification accuracy, a
fundamental issue is how to properly characterize such changes. In this thesis, we propose
to address this issue using a feature driven framework, which uses node content features
to model and capture network changes for classification. Fig. 6-2 demonstrates how node
features can be used to characterize changes in the network. More specifically, nodes and
edges with solid lines denote network observed at time point t, while dashed circles and
edges mean networked data arriving at t+1. Nodes and edges with curved lines are removed
at t + 1, and the underlined features (keywords) are also removed at t + 1. Nodes are
colored based on their class labels, and white nodes mean unlabeled nodes. At time point
t, keywords “System" and “Network" are selected as features to represent node content
(assuming the number of features is limited to 2) and classify nodes into two classes (red
vs. cyan). At t + 1, the network changes incurring new structures and node content. By
updating the selected features and using “Spectrum" to replace “System", the new feature
space {“Network”, “Spectrum”} can effectively classify unlabeled nodes into right classes.
The above observations motivate the proposed research that uses features to capture
changes in streaming networks for node classification. When a network is experiencing
107
Figure 6-2: An example of using feature selection to capture changes in a streaming net-
work (keywords inside each node denote node content).
changes in the topology structures and node content, we can try to identify a set of important
features that can best reveal such changed network structures and node content. Because
in a networked world, nodes close to each other in the network topology structure space
tend to share common content information [26], we can use selected features to design a
“similarity gauging” procedure to assess the consistency of the network node content and
structures in order to determine the labels of unlabeled nodes. A smaller gauging value
indicates that the node content and structures has a better alignment with the node label. So
the gauging based classification is carried out such that for an unlabeled node, its label is
the class which results in the minimal gauging value with respect to the identified features.
By updating the selected features, the node classification can automatically adapt to the
changes in the streaming network for maximal accuracy gain.
The main contribution of the chapter, compared to existing works, is twofold:
• Streaming Network Node Classification: We propose a new streaming network
node classification (SNOC) method that takes node content and structure similarity
into consideration to find important features to model changes in the network for
node classification. This method is not only more accurate than existing node clas-
sification approaches, but is also effective to capture changes in networks for node
classification.
• Streaming Network Feature Selection: We introduce a novel streaming network
108
feature selection framework, SNF, for streaming networks. To ensure feature evalu-
ation can timely adapt to the changes in the network, SNF can incrementally update
the evaluation score of an existing feature by accumulating changes in the network.
This allows our method to effectively handle streaming networks with changing fea-
ture space and feature distributions for better runtime and performance gain.
The remainder of the chapter is structured as follows. The problem definition and the
overall framework are introduced in Sect. 6.2. Sect. 6.3 introduces the proposed node
classification method, followed by experiments in Sec. 6.4.
6.2 PROBLEM DEFINITION AND FRAMEWORK
A streaming network contains a dynamic number of nodes and edges, and the node content
may also change in terms of new features or new feature values. At a given time point t,
the network nodes are denoted by X = {(xi, yi)}nti=1, where xi ∈ R
dt is the original feature
vector of node i, and yi ∈ Y = {0, 1, 2, . . . , c} is the label of node i. nt and dt denote the
number of nodes and the dimensionality of the node feature space at time point t, which
may vary with time. Specificially, yi = 0 means that node i is unlabeled. A ∈ Rnt×nt is the
adjacency matrix of the networked data, where Aij = 1 if there is an edge (link) between
nodes i and j, and Aij = 0 otherwise. A path Pij between nodes i and j is a sequence
of edges, starting at i, ending at j. The length of a path is the number of edges on it. For
each adjacency matrix A, the element [Ak]ij of the kth power matrix denotes the number of
length-k paths from i to j in the network [25].
To represent network node content, we use F = {f 1, . . . , f r, . . . , fdt} to denote node
feature space at time point t, where the feature dimension dt dramatically changes with time
t. We use X = [x1, x2, . . . , xnt ] = [f1, f2, . . . , fdt ]� ∈ Rdt×nt to represent the data matrix,
and C ∈ Rnt×nt represents the label relationship matrix of the networked data, where
Cij = 1 means nodes i and j are in the same class, and Cij = 0 otherwise. We use f r to
denote a feature, and use bold-faced fr to represent indicator vector of feature f r, where
[fr]j records the actual value of feature f r in node j. In a binary feature representation (such
109
as the bag-of-word for text), we have [fr]j = 1 if feature f r appears in node j, and [fr]j = 0,
otherwise. Obviously, fr helps capture the distribution of feature f r in the network.
Streaming network node classification aims to classify unlabeled nodes in the network,
at any time point t, with maximal accuracy. To capture dynamic changes of the network,
we propose to use feature selection to timely discover a feature subset S of size m from
F . When discovering feature set S, both the node content and the network structures are
combined to find the most informative features at each single time point t. As a result, the
node classification can adapt to the changes in the network to achieve maximal accuracy.
Fig. 6-3 shows the framework of our streaming network node classification method
(SNOC). More specifically, Panel A: at time point t, the network is denoted by nodes and
edges with solid lines. Dashed nodes and edges denote new nodes and edges arriving at
time point t+ 1. Colour bars in nodes means different features and curved bars means that
a feature appeared at t but is removed at t + 1, such as the purple bar in Node 3. Nodes
and edges with curved lines means they exist at t, but are removed at t + 1, like Node 4.
Panel B: at time point t, candidate features and selected features are identified based on
F-score qt(fr). At time point t + 1, streaming network feature selection (SNF) updates
the scores of old features (candidate features at t), and also calculates feature scores for
new features. Panel C: At time point t + 1, SNOC uses selected features as gauge to test
whether to classify an unlabeled Node 5 as positive (+) or negative (-). The one with the
smallest gauging value is used to label Node 5. To capture changes in streaming networks,
an incremental streaming network feature selection method, SNF, is proposed to timely
discover a set of most informative features in the network. To classify unlabeled nodes,
SNOC takes both label similarity and structure similarity into consideration and uses a
quality criterion to find most suitable label for an unlabeled node.
6.3 THE PROPOSED METHOD
To classify unlabeled nodes in a streaming network, our theme is to let (1) nodes sharing
the same class and having a high structure similarity be close to each other, and (2) nodes
belonging to different classes and having a weak structure relationship be far away from
110
Figure 6-3: The framework of the proposed streaming network node classification (SNOC)
method.
each other. This is motivated by the commonly observed phenomenon [26] that nodes close
to each other in network topology structures tend to share common content information. For
example, friends in the same cohort group are more likely to share similar experiences or
interests, and a paper and its references often contain relevant research subjects/topics. Our
proposed theme is also consistent with the relational collective inference modeling [34] that
uses relationships between class labels and attributes of neighboring objects in the network
for classification.
Following the above theme, we can regard streaming network node classification as
an optimization problem, which tries to find the optimal assignment of the class labels to
unlabeled node set X u, such that the assigned class labels Yu ⊆ Y can result in the whole
network to maximally comply with the proposed theme, as defined in Eq.(6.1).
Yu∗ = argminYu⊆Y
E(Yu) (6.1)
where Yu is an assignment of labels to unlabeled nodes in the network, and Yu∗ is the
optimal assignment, which results in the minimal utility score E(Yu).
Following the node classification objective function in Eq.(6.1), the key question is
how to properly define utility function E(·). Clearly, the node content provides valuable
information to determine the label of each node, so we should define E(·) based on the
node content Feature Space. In streaming networks, the feature space used to denote
the node content is continuously changing with new features or updated feature values.
Using all features to represent the network is clearly suboptimal. If a set of good features
111
can be found to capture changes in the network, the node classification in Eq.(6.1) will
automatically adapt to the changes in the network for maximal accuracy. So Eq.(6.1) is
re-written as
Yu∗ = argminYu⊆Y
E(Yu,S) (6.2)
where S is the selected feature set used to capture changes in a streaming network.
Because the utility function E(Yu,S) is constrained by the selected features S, finding
the optimal S becomes the next challenge. Obviously, a good S should properly capture
network node relationships in terms of the node content, node labels, and node topology
structures. That means the node relationships in Feature Space should follow our proposed
theme with the node similarity being impacted by (1) the label-based similarity in the Label
Space; and (2) the structure-based similarity in the Structure Space.
Accordingly, node classification in Eq. (6.2) can be divided into two major steps: (1)
Finding an optimal feature set S; (2) Finding an optimal assignment of labels to unlabeled
nodes such that the utility score E(Yu,S) calculated based to the selected features S has the
minimal value. Therefore, we derive an updated evaluation criterion E(Yu,S) as follows:
E(Yu,S) = 1
2
∑i∈Xu
∑j∈X
h(i, j, yi)(DSxi −DSxj)2
s.t. min(1
2
∑i,j∈X
h(i, j)(DSxi −DSxj)2),S ⊆ F , |S| = m
(6.3)
where h(i, j) is the similarity between nodes i and j in the network structure space
that will be formally defined in Eq.(6.8). h(i, j, yi) is the similarity between nodes i and j
conditioned by setting the label of unlabeled node i as yi. In Eq.(6.3), (DSxi−DSxj)2 mea-
sures the feature based distance between nodes i and j w.r.t. the current selected features
S . DS is a diagonal matrix indicating features that are selected into the selected feature set
S (from F), where
[DS ]ij =
{1, if i = j and f i ∈ S;
0, otherwise.
In Eq.(6.3), we use network structure similarity h(i, j, yi) as the weight value of the
112
node feature distance (DSxi − DSxj)2. If nodes i and j have a high structure similarity,
their feature distance will have a large weight value and therefore plays a more important
role in the objective function. In an extreme case, if nodes i and j have a zero structure
similarity, their feature distance will not have any impact on the objective function at all. By
doing so, we can effectively combine node structure similarity and node content distance
to assess the consistency of the whole network.
For streaming networks, the selected feature set S should be dynamically updated to
capture changes in the network. By using dynamic feature set S to guide the node classi-
fication, Eq.(6.3) provides an efficient way to classify nodes in dynamic networks. This is
mainly because that any significant changes in the network will be captured by S , and by
using S as gauge for node classification, our method can automatically adapt to the changes
in the network structures and node content.
The solutions to the objective function in Eq.(6.3) require optimization for both vari-
ables (DS and Yu). To solve Eq.(6.3), we divide the process into two parts: (1) propose a
novel streaming network feature selection framework, SNF, to take both network structures
and node labels into consideration to find optimal feature set S; and (2) propose an Lapla-
cian based quality criterion to grade an unlabeled node with respect to different labels by
using S as the gauge. Finally, the node classification is achieved by finding best labels that
result in the minimal gauging values.
6.3.1 Streaming Network Feature Selection
Given a streaming network, the network observed at a single time point t can be consid-
ered as a static network. In this subsection, we first introduce feature selection on a static
network, and then extend to streaming networks.
Feature Selection on a Static Network
We first define feature selection as an optimization problem. Our target is to find an optimal
set of features, which can best represent the network node content and structures.
The network edges and node labels both play important, yet different, roles for node
113
classification. We assume that the optimal feature set should have the following properties:
• Label Similarity: a) labeled nodes in the same class should be close to each other,
and labeled nodes in different classes should be far away from each other; b) unla-
beled nodes should be separated from each other.
• Structure Similarity: The structure similarity between nodes i and j is closely tied
to the number of paths and the path length between them. The more the number of
length-l paths between i and j, the higher their structure similarity is. The shorter the
path length between two nodes, the higher their structure similarity is.
Note that Item b) in the first bullet incorporates the distributions of unlabeled nodes,
and tends to select features that can separate nodes far from each other. It is similar to
the assumption of the Principle Component Analysis, which is expressed as the average
squared distance between unlabeled samples [88]. Item b) intends to disfavor features that
are too rare or too frequent in the data set, because unlabeled nodes cannot be separated
from each other using these features [42].
The above two properties can be formalized as follows:
(1) Minimizing Label Similarity Objective Function:
JL(fr) =
1
2
∑Cij=1
(V�r xi − V�
r xj)2 −1
2c
∑Cij=0
(V�r xi − V�
r xj)2 (6.4)
where c is the total number of classes, and Vr is an indicating vector showing that
feature is selected, and its definition is as [Vr]i = 1 if i = r, and [Vr]i = 0 otherwise.
(2) Minimizing Structure Similarity Objective Function:
JS(fr) =
1
2
nt∑i,j=1
Θij(V�r xi − V�
r xj)2 (6.5)
where Θij in Eq.(6.5) means the l-maximal length path weight parameter between
nodes i and j, which is defined as follows:
Θ =l∑
i=1
1
2i−1Ai (6.6)
114
Figure 6-4: An example of using feature selection to capture structure similarity.
The number of paths between two nodes is a proved good indicator of the node structure
similarity. The shorter the path between two nodes, the closer the two nodes are in structure.
So the weight in Eq. (6.6) will decrease with the increase of the path length. An example
is shown in Fig. 6-4. More specifically, left panel shows the network in original feature
space and right panel shows the network in selected feature space (which contains m = 6
features). On the right panel, Node 1 shares more paths with Node 3 than with Node 7, and
the paths between Node 1 and Node 3 are shorter than the ones between Node 1 and Node
7. So Nodes 1 and 3 are closer to each other than Nodes 1 and 7 from structure similarity
perspective. The structure similarity is tied to the representation of the nodes in selected
feature space. If two nodes have an edge, they will be close to each other in the selected
feature space (e.g. Node 1 and Node 5 have one edge, so they have three common features
in selected feature space). With the increase of the path length between two nodes, the
distance of the nodes in feature space also increase (e.g. Node 1 and Node 6 have no edge,
so they share no common feature in selected feature space).
In Eq. (6.5), Θ is used as a penalty factor for two nodes that have high structure simi-
larity but are far away from each other in feature space. Intuitively, nodes close in topology
structure have a high probability of sharing similar node content [26]. So if any two nodes
i and j are close to each other in topology structure but have a large distance in the original
feature space, their Θij value will increase the objective value and thus encourages feature
selection module to find similar features for i and j. This provides a unique way to impose
network topology structures into the node feature selection process.
By combining the label similarity objective function in Eq. (6.4) and structure similarity
115
objective function in Eq. (6.5), we can form a combined evaluation criterion for each feature
f r as follows:
J (f r) = ξ · JL(fr) + (1− ξ) · JS(f
r) (6.7)
where ξ (0 ≤ ξ ≤ 1) is the weight parameter used to balance the contributions of
network structures and node labels. The ξ values allow users to fine tune structure and label
similarity in the feature selection for networks from different domains. In Section 6.4, we
will report the algorithm performance w.r.t. different ξ values on benchmark networks. An
example of using feature selection to capture structure similarity is also shown in Figure 6-
4.
By defining a weighted matrix W = [Wij]nt×nt as
Wij = [ξ, ξ/c] · [Cij,Cij − 1]� + (1− ξ) ·Θij (6.8)
we can rewrite Eq. (6.7) as follows:
J (f r) =1
2
nt∑i,j=1
(V�r xi − V�
r xj)2Wij
=1
2
nt∑i,j=1
(fri − frj)2Wij
= (fr)�Dfr − (fr)�Wfr = (fr)�Lfr
(6.9)
where D is a diagonal matrix whose entries are column sums of W, i.e., Dii =∑
j Wij .
L = D − W is a Laplacian matrix.
In Eq. (6.8), Wij is equal to the structure similarity matrix h(i, j) in Eq. (6.3), so the
constraint part in Eq. (6.3) is equal to minimizing∑
fr∈S J (f r).
As a result, the problem of feature selection in a static network is equal to finding a
subset S containing m features that satisfy:
min∑fr∈S
J (f r), s.t. S ⊆ F , |S| = m (6.10)
DEFINITION 22 (F-Score) Let X = [f1, f2, . . . , fdt ]� represents the networked data, and
W is a matrix defined as Eq. (6.8). L is a Laplacian matrix defined as L = D − W, where
116
D is a diagonal matrix, Dii =∑
j Wij . We define a quality criterion q called F-Score, for
a feature f r as
q(f r) = (fr)�Lfr (6.11)
The solution to Eq. (6.10) can be found by using F-Score to assess features in the
original feature space F . Suppose the F-Score for all features are denoted by q(f 1) ≤q(f 2) ≤ · · · ≤ q(fdt) in a sorted order, the solution of finding the m most informative
features is
S = {f r| r ≤ m} (6.12)
Feature Selection on Streaming Networks
When the network continuously evolves at different time points T = {t1, t2, . . . }, network
structure, including edges and nodes, and node features may change accordingly. So we
need to adjust the selected feature set S in order to characterize changed network. Com-
pletely rerunning the feature selection at each single time point from the scratch is time
consuming, especially for large size networks. In this section, we introduce an incremen-
tal feature selection method, which calculates the score of an old feature based on new
networked data and then combines it with the old feature scores to update the feature’s
final score. Such an incremental feature selection process ensures our method to tackle the
“Continuous Changing Network Volumes” challenge.
To incrementally update scores for old features, we separate networked data into two
parts: a) nodes and edges that already exist at time point t; and b) new emerged (or dis-
appeared) nodes and their relevant topology structures at t + 1. After that, we use Part
a) to get the changing parts of feature distributions in the old networks and use Part b) to
calculate local incremental scores and update the scores of existing features, respectively.
If the changed score of an old feature at t + 1 can be obtained by using Part a) and Part b)
efficiently, we can compute a feature score by combining its old score at t and the changed
score at t+ 1.
For ease of representation, we define following notations:
• A subscript t or t + 1 of each matrix (or a vector) means the time point t or t + 1 of
117
the matrix (or vector).
• frt and frt+1 denote indicator vectors of feature f r in the network at time point t and
t + 1, respectively, where frt ∈ Rnt×1 and frt+1 ∈ R
nt+1×1. Then we define fr′
t+1 ∈R
nt×1 as [fr′
t+1]i = [frt+1]i, where 1 ≤ i ≤ nt.
• Δn denotes the number of new arrived nodes (from time point t to t+ 1).
• Wo denotes the weight matrix defined in Eq.(6.8) between new nodes arrived at time
point t+ 1 and old nodes that already existed at time point t.
• Wc denotes the changed weight matrix from time point t to t+ 1 between old nodes
that already existed at time point t.
• WΔn denotes the weight matrix between new nodes that arrived at time point t+ 1.
So the weight matrix of the networked data at time point t+1 is Wt+1, and the updated
part between t and t+ 1 is WΔt+1, i.e.
Wt+1 =
⎡⎢⎣ Wt + Wc Wo
W�o WΔn
⎤⎥⎦ , WΔt+1 =
⎡⎢⎣ Wc Wo
W�o WΔn
⎤⎥⎦.Then the F -Score of an old feature f r at time point t+ 1 is,
qt+1(fr)
=1
2
nt+1∑i,j=1
([frt+1]i − [frt+1]j)2[Wt+1]ij
=1
2
nt+1∑i,j=1
([frt+1]i − [frt+1]j)2(
⎡⎢⎢⎣ Wt 0
0 0
⎤⎥⎥⎦ij
+ [WΔt+1]ij)
= (fr′
t+1)�Ltfr
′t+1 +
1
2
nt+1∑i,j=1
([frt+1]i − [frt+1]j)2[WΔ
t+1]ij
= (frt )�Ltfrt + (fr
′t+1 − frt )
�Lt(fr′
t+1 − frt ) + (frt+1)�LΔ
t+1frt+1
(6.13)
In Eq.(6.13), qt+1(fr) contains three parts. The first term is qt(f
r), and the last two terms
are the changed scores at t + 1, which correspond to Part a) and Part b), respectively.
118
Formally,
qΔt+1(fr) = (fr
′t+1 − frt )
�Lt(fr′
t+1 − frt ) + (frt+1)�LΔ
t+1frt+1 (6.14)
where LΔt+1 is the Laplacian matrix of WΔ
t+1. We calculate WΔt+1 by using changed part of
the network (including nodes and edges) as follows:
WΔt+1 = ξWΔ(L)
t+1 + (1− ξ)WΔ(S)t+1
(6.15)
WΔ(L)t+1 and WΔ(S)
t+1 are used to calculate the changed parts of label relationships and structure
relationships, respectively, from time point t to t+ 1.
WΔ(L)t+1 =
{([1, 1/c] · [Cij,Cij − 1]�, if i or j ∈ Δn;
0, otherwise.
and
WΔ(S)t+1 = Θt+1 −Θt,
[WΔ(L)t+1 ]ij denotes the incremental weight parameter of label similarity between nodes i
and j, and [WΔ(S)t+1 ]ij is the incremental l-length path weight parameter between nodes i
and j. Both of them are “incrementally" calculated by only using the changed parts of the
streaming networks.
As a result, we can obtain a new score qt+1(fr) by adding qΔt+1(f
r) to qt(fr), with
the new scores of old features being used in the final feature selection process. When
the streaming networks change with time, an old feature f r’s new score, qΔt+1(fr), can be
incrementally calculated by using the changed part of the network at t + 1 compared to
qt(fr) time point t, which allows SNF to efficiently update feature scores for large scale
dynamic networks.
For streaming features with an infinite feature space, it is infeasible to keep all fea-
ture scores for future comparison. So SNF maintains a small feature set, called candi-
date feature set, for future comparisons, i.e. T = {f 1, f 2, . . . , fm, fm+1, . . . , fk}, where,
q(f 1) ≤ q(f 2) ≤ · · · ≤ q(fk) and q(fk) ≤ 2q(fm). This setting ensures that the discarded
features are very unlikely to be selected at the next time point. So SNF always keeps a
119
Algorithm 5 SNF: Steaming network feature selection
Input: (1) the network at time points t and t + 1: Xt and Xt+1, (2) candidate feature set:
Tt, (3) F-Score list of Tt: Ht, (4) size of selected feature set: m, and (5) new feature
set Vt+1.
Output: selected feature set: St+1 and candidate feature set Tt+1.
1: Initialize the score list Ht+1 and generate the updated Laplacian matrix LΔt+1;
2: // calculate F-Score for new features
3: for f r ∈ Vt+1 do4: q(f r) ← (fr)�Lt+1fr
5: Ht+1 ← q(f r) ∪Ht+1
6: end for7: // update F-Score for old features
8: for f r ∈ Tt do9: qΔt+1(f
r) ← (fr′
t+1 − frt )�Lt(fr′
t+1 − frt ) + (frt+1)�LΔ
t+1frt+1
10: qt(fr) ← Ht(f
r)11: qt+1(f
r) ← qt(fr) + qΔt+1(f
r)12: Ht+1 ← q(f r) ∪Ht+1
13: end for14: Sort Ht+1 in ascending order
15: St+1 ← top-m of Ht+1
16: Tt+1 ← top-k of Ht+1, where q(fk) ≤ 2q(fm)
candidate feature set with dynamic size k, and discard less informative features. For all
new features appearing in the new nodes, SNF will calculate their feature scores in order to
ensure that important new features can be discovered immediately after they emerge in the
network.
Algorithm 5 lists the detailed SNF algorithm, which incrementally compares scores of
new features and old features in T and selects top-m features to form the final feature set.
It is worth noting that the proposed SNF method can efficiently handle three types of
changes in streaming networks: (1) Feature distribution changes: For each feature f r, if
its distributions change from f rt to f r
t+1, the first part of Eq. (6.14) is used to calculate its
changed score; (2) Node addition and structure changes: For new nodes and their associ-
ated edge connections, the second part of Eq. (6.14) will capture the topological structure
changes and the node addition information; and (3) Node deletion: For nodes that are re-
moved at t+1, we can set their feature indicators to 0 to indicate that the nodes have empty
node content, and then use (1) to update feature scores.
Time Complexity Analysis: the time complexity of getting Wt is O(n3lt ). So the total
120
time complexity of getting qt(fr) in Algorithm 5 is O(n3l
t ). While if the size of WΔt+1 is
nΔt+1×nΔ
t+1, the complexity of getting qΔt+1(fr) in Algorithm 5 is O([nΔ
t+1]3l). In most cases,
nΔt+1 << nt. So our updated method is very efficient under streaming network setting.
6.3.2 Node Classification on Streaming Networks
Once the most informative m features are identified at time point t (denoted by St =
{f 1, f 2, · · · , fm}), the network nodes can be represented by using selected features as:
Xt = [x1, x2, . . . , xnt ] ⇒ XSt = [f1, f2, . . . , fm]� ∈ Rm×nt ,
The node classification is to provide accurate labels for unlabeled nodes in the network
at time point t. If an unlabeled node u is correctly labeled, u should be right positioned to
other nodes w.r.t. the label and structure similarity as defined in Eq.(6.3). So the quality
criterion of Eq.(6.3) for each unlabeled node u at a given time point t can be rewritten as
follows,
E(yu,St) =1
2
nt∑i,j=1
(DStxi −DStxj)2[Wyu ]ij (6.16)
where yu ∈ Y , and Wyu means the weight matrix generated from Eq.(6.8) by setting
the label of u to yu. DSt is a diagonal matrix indicating features that are selected into the
feature set from F to St.
Because the quality criterion is only affected by the changed part of weight matrix, and
the changes in node labels only affect the label similarity part, we can define the changed
weight matrix as:
[WΔyu ]ij =
⎧⎪⎪⎨⎪⎪⎩1, if j = u, yi = yu or i = u, yj = yu;
−1c, if j = u, yi = yu or i = u, yj = yu;
0, if j = u and i = u. (6.17)
So the quality criterion in Eq.(6.16) can be replaced by
E ′(yu,St) =1
2
nt∑i,j=1
(DStxi −DStxj)2[WΔyu ]ij (6.18)
121
Algorithm 6 SNOC: Streaming Network Node Classification
Input: (1) the network: Xt and Xt−1, (2) label list: Yt, (3) candidate feature set at point
t − 1:Tt−1, (4) F-Score list of Tt−1: Ht−1, (5) size of selected feature set: m, and (6)
new feature set Vt.
Output: label list for unlabeled data: Yut .
1: (St, Tt) ← SNF (Xt,Xt−1, Tt−1,Ht−1,Vt,m)2: Mapping Xt into XSt by using St;
3: for each unlabeled node u do4: y∗u = argmin
yu∈Y(tr([XSt ]�LΔyuXSt))
5: end for
Then we can calculate E(yu,St) as
E ′(yu,St) =1
2
nt∑i,j=1
(DStxi −DStxj)2[WΔyu ]ij
= tr(D�St
Xt(DΔyu − WΔyu)X�t DSt)
= tr(D�St
XtLΔyuX�t DSt)
= tr([XSt ]�LΔyuXSt)
(6.19)
where tr(·) is the trace of a matrix, and LΔyu is the Laplacian matrix of WΔyu .
So our target is to select a label for an unlabeled node u to ensure:
minyu∈Y
E ′(yu,St) (6.20)
DEFINITION 23 (SNC) Let XSt = [f1, f2, . . . , fm]� represents the mapped network nodes
in the selected feature space. Suppose WΔyu is a matrix defined as Eq. (6.17). LΔyu is a
Laplacian matrix defined as LΔyu = DΔyu − WΔyu , where DΔyu is a diagonal matrix,
[DΔyu ]ii =∑
j [WΔyu ]ij . We define a labeling criterion, called streaming network criterion
SNC, for each unlabeled node u as follows,
y∗u = argminyu∈Y
(tr([XSt ]�LΔyuXSt)) (6.21)
Through the SNC criterion, Eq. (6.1) can be achieved by calculating yu for each single
unlabeled node. Algorithm 6 lists the detailed process of the proposed streaming network
122
node classification (SNOC) method, which uses SNF to select a feature space to capture
network changes and then assigns different labels to each unlabeled node by using selected
features as the gauge. The class label of an unlabeled node is the one that results in the
minimal gauging value with respect to the selected features.
6.4 EXPERIMENTS
In this section, we conduct extensive experiments to evaluate the efficiency and effective-
ness of SNOC for node classification in static and streaming networks.
6.4.1 Experimental Settings
We validate the performance of SNOC on the following four real-world networks.
Cora1 is a citation network with 2,708 publications (i.e. nodes) classified into one
of seven classes. The citation relationships are captured in 5,429 links. The node
content is described by a 0/1-valued word vector indicating the absence/presence of
the corresponding word from a dictionary of 1,433 unique words.
CiteSeer1 consists of 3,312 scientific publications classified into one of six classes.
The network consists of 4,732 links. Each publication in CiteSeer is described by a
0/1-valued word vector from a dictionary with 3,703 unique words.
PubMed Diabetes1 network consists of 19,717 publications (nodes) from PubMed
database pertaining to three types of diabetes. It has 44,338 links. Each paper is de-
scribed by a TF-IDF weighted word vector from a dictionary containing 500 unique
words.
DBLP2 network contains 2,084,055 papers and 2,244,018 citation relationships. We
separate papers into six classes: DataBases, Artificial Intelligence, Hardware and
Architecture, Applications and Media, System Technology, and others. Each paper
is denoted by a 0/1-valued word vector from a dictionary with 3,000 words.
1http://linqs.cs.umd.edu/projects//projects/lbc/index.html2http://arnetminer.org/citation
123
To evaluate the performance of SNOC for streaming networks, we firstly test the algo-
rithm performance on static networks by using three networks (Cora, CiteSeer and PubMed
Diabetes). After that, we use DBLP and PubMed Diabetes networks as our streaming net-
work test bed (because the sizes of CiteSeer and Cora networks are too small for testing
in streaming network settings). DBLP is inherently a streaming network, because publi-
cations are continuously updated and top keywords (node features) are also continuously
changing with respect to the time. For DBLP network with streaming network setting, we
choose 2,000 publications for each year and build a streaming network covering the time
period from 1991 to 2010. In addition, we also use PubMed Diabetes network to simulate
a streaming network with 1,000 random nodes to be included for each time point t (the
experiments include a total of 15 time points).
For most experiments, we randomly label 40% of nodes in the network and use the
remaining nodes as test data (this is reasonable setting because real-world networks always
have more unlabeled nodes than the number of labeled ones). In addition, we also report the
algorithm performance with respect to different percentages of training/test nodes (detailed
in Fig. 6-6(c)). For streaming network experiments, the accuracy is tested on the new nodes
arrived at each time point. The default size of selected feature set is m = 100, the default
value of weight parameter ξ = 0.7, and the default maximal path length in Eq. 6.6 is set to
l = 3.
Baseline Methods: We compare the performance of SNOC with four baselines:
Information Gain+SVM (IG+SVM): This method ignores link structures in the network
and uses Information Gain (IG) to select the top-m features from all nodes (using content
information in the original bag-of-feature representation). LIBSVM [10] is used as the
learning algorithm to train classifiers for node classification.
Link Structure+SVM (LS+SVM): This method ignores label information of labeled nodes
and only uses structure similarity to construct the weight matrix (W) and then calculates
the feature score in a similar way as SNF. LIBSVM is also used as the learning algorithm
to train classifiers for node classification.
Collective Classification (GS+LR): This method refers to the combined classification of
interlinked objects including the correlation between node label and node content. In our
124
50 100 200 30020
25
30
35
40
45
50
55
60
65
70
# of selected features (m)
Acc
urac
y %
IG+SVMLS+SVMDYCOSGS+LRSNOC
(a) Cora
50 100 200 30030
40
50
60
70
80
90
# of selected features (m)
Acc
urac
y %
IG+SVMLS+SVMDYCOSGS+LRSNOC
(b) CiteSeer
50 100 200 30030
40
50
60
70
80
90
100
# of selected features (m)
Acc
urac
y %
IG+SVMLS+SVMDYCOSGS+LRSNOC
(c) PubMed Diabetes
Figure 6-5: The accuracies on three real-world static networks w.r.t. different numbers of
selected features (from 50 to 300).
experiments, we use collective classification method [61], which uses a simplified version
of Gibbs sampling (GS) as the approximate inference procedures for networked data, with
Logistic Regression (LR) being used as classifiers for node classification.
DYCOS: This is a recently proposed method that combines text content and links for net-
work node classification [2]. It is considered the state-of-the-art classification method in
streaming networks. A random walk approach in conjunction with the content of the net-
work is used for node classification. This results in a new approach to handle variations in
content and linkage structures. gini-index is used to select features in this method.
All experiments are conducted on a cluster machine with 16GB RAM and Intel CoreTM
i7 3.20 GHZ CPU.
6.4.2 Performance on Static Networks
Table I reports the performance of different methods on three static networks (Cora, Cite-
Seer and PubMed Diabetes). The results show that SNOC outperforms other four baseline
methods on all three networks with significant performance gain. This is mainly attributed
to SNOC’s integration of network topology structure and node labels to explore features
for node classification. Although DYCOS indeed considers network linkage information
and GS+LR considers the correlation between node labels and node content, they do not
take into account the impact of deep structure information for both feature selection and
classification process. So their performance is inferior to SNOC. Noticeably, even though
DYCOS takes structure information into account, the actual contributions of label similar-
125
1 2 3 4 5
40
50
60
70
80
90A
ccur
acy
%
CoraCiteSeerPubMed Diabetes
(a) Maximal Path Length l
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
30
40
50
60
70
80
90
Acc
urac
y %
CoraCiteSeerPubMed
(b) Weight Parameter ξ
10% 20% 30% 40% 50% 60% 70% 80% 90%20
30
40
50
60
70
Acc
urac
y %
IG+SVMLS+SVMDYCOSGS+LRSNOC
(c) Percentage of Labeled Nodes
Figure 6-6: The accuracy on three networks w.r.t. (a) different maximal lengths of path l(from 1 to 5), (b) different values of weight parameter ξ (from 0 to 1), and (c) different
percentages of labeled nodes.
ity and structure similarity have not been optimized in those methods, in order to achieve
bets feature selection results for networked data. This partially explain why the accuracies
of DYCOS cannot match IG+SVM for PubMed data set. In comparison, SNOC considers
both labeled and unlabelled nodes, and combines node label similarity and node structure
similarity to find effective features. All these designs help SNOC outperform all other
baseline methods.
Table 6.1: Accuracy Results on Static Network.
Data sets Cora CiteSeer PubMed
IG+SVM 50.34%±1.42% 57.21%±1.59% 65.24%±1.33%
LS+SVM 27.37%±2.85% 39.64%±2.66% 43.06%±2.75%
DYCOS 53.57%±1.24% 64.38%±1.29% 64.53%±1.86%
GS+LR 55.17%±1.09% 65.93%±2.37% 72.88%±2.05%
SNOC 62.66%±1.57% 73.81%±1.46% 81.09%±2.37%
In Fig. 6-5, we report the algorithm performance with respect to different numbers of
selected features on three networks. Overall, SNOC achieves the highest accuracy gain on
all three networks with different feature sizes. LS-SVM has the lowest accuracies because
network structure alone provides very little useful information (compared to the node con-
tent) for node classification. SNOC and GS+LR have the highest accuracies on Cora and
CiteSeer networks when selecting m = 100 features, and on PubMed Diabetes data set
when m = 50, whereas DYCOS’s accuracies decrease with the increasing of the feature
126
size on all three data sets. The accuracies of all methods become close to each other with
the number of selected features continuously increase. This is because that including more
features may introduce interference and dilute the significance of important node features,
so the benefit of feature selection is becoming less significant. Because SNOC balances the
label and structure information to feature space for node classification, it still outperforms
other baseline methods.
In Fig. 6-6(a), we report the accuracies w.r.t. different maximal lengths of path to
calculate Eq. 6.6. The results show that the accuracies decrease if the path lengths are
too long. This is because even though the path between two nodes is relevant to the node
structure similarity, if the path length is too long, the similarity maybe deteriorated by
special paths like cycles and and become inaccurate to capture the node similarity.
In Fig. 6-6(b), we report the algorithm performance w.r.t. the changing values of weight
parameter ξ. According to the definition in Eq. (6.7), ξ is used to balance the contribution
of network structures and node labels. The results from Fig. 6-6(b) show that node labels
play a more important role than network structures. For Cora network, the accuracy reaches
the peak when ξ = 0.6, while the highest accuracies appear on ξ = 0.7 for CiteSeer data
and PubMed data, respectively. This suggests that network structures and node labels have
different contributions to feature selection for networks from different domains. In order to
achieve the best performance, users may need to carefully choose a suitable weight value
for different networks.
In previous experiments, the percentage of labeled nodes is fixed to 40% of the net-
work. In reality, the percentage of labeled nodes in networks may vary significantly, so
in this subsection, we study the performance of all methods on networks with different
percentages of labeled nodes (due to page limitations, we only report the results on Cora
network).
The results in Fig. 6-6(c) show that when the number of labelled nodes in the network
increases, all methods achieve accuracy gains. After the majority of the network nodes
are labeled, all four methods except LS-SVM achieve similar accuracies. This is because
labeled nodes provide sufficient content information for classification. Interestingly, our
results show that when the network contains a small percentage of labeled nodes, e.g. 30%
127
1991 1996 2001 2006 201040
50
60
70
80
90A
ccur
acy
%
SNOCDYCOSGS+LR
(a) DBLP
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1540
50
60
70
80
90
Acc
urac
y %
SNOCDYCOSGS+LR
(b) PubMed Diabetes
1991 1996 2001 2006 201040
50
60
70
80
90
Acc
urac
y %
SNOCDYCOSGS+LR
(c) extended DBLP
Figure 6-7: The accuracy on streaming networks: (a) accuracy on DBLP citation network
from 1991 to 2010, (b) accuracy on PubMed Diabetes network for 15 time points, and (c)
accuracy on extended DBLP citation network from 1991 to 2010.
or less, SNOC can achieve much more significant accuracy gains compared to other meth-
ods. This observation indicates that SNOC is more suitable for networks with very few
labeled nodes. This is mainly attributed to the fact that SNOC can integrate node labels and
network structures (which also include unlabeled nodes) to find most effective features to
characterize the network node content and topology information. In addition, the similarity
gauging process also tries to find optimal node labels for unlabeled nodes to ensure the
distance evaluated in the feature space are consistent with the network structures.
6.4.3 Performance on Streaming Networks
Because only DYCOS and GS+LR are designed for classifying networked data, in the
following, we only compare SNOC with DYCOS and GS+LR on streaming networks.
In Figs. 6-7 (a) and (b), we report accuracies on DBLP and PubMed networks in a
streaming network setting. In addition, Fig. 6-8 further reports the runtime of different
methods. Because GS+LR is designed for static networks, it needs to be rerun at each time
point. Both DYCOS and SNOC can handle streaming networks.
For DBLP network, the results show that the proposed SNOC outperforms all other
methods in streaming network setting. An exception is on 1997, GS+LR method, which
is more time-consuming as shown in Fig. 6-8, can match SNOC. This shows that a good
balance between node content and network structure is very important for node classifica-
tion. Although GS+LR emphasizes on node content information and DYCOS emphasizes
128
on network structures, both of them, however, fail to capture the changes in streaming net-
works. Meanwhile, the runtime performance in Fig. 6-8 shows that DYCOS is as fast as
SNOC but its accuracy is inferior to SNOC because DYCOS uses a random walk to predict
node labels. Because random walks are inherently uncertain and contain many random-
ness in the classification process, the node classification results of DYCOS are inferior to
both SNOC and GS+LR. Meanwhile, as the time steps t continuously increase, the runtime
curve of DYCOS increases much quicker than SNOC. This is because SNOC only needs to
consider the changed part of the network for both node classification and feature selection.
Although GS+LR obtains better accuracies compared to DYCOS, it is much more time-
consuming compared to SNOC and DYCOS. This is mainly because GS+LR is an iterative
algorithm designed for static networks, so it needs to be rerun at each time step.
To validate the performances of different methods on streaming networks with all types
of changes (including dynamic changing node features, addition and deletion nodes and
edges), we allow each node (i.e., paper) to include its reference’s title into the node content.
For example, if a paper pi is cited by a paper pj at a particular time point t, we will include
pj’s title into node pi’s content. By doing so, we can introduce dynamic changing features
to nodes in the network. In addition, we also continuously remove old papers in the network
to maintain papers published within a five-year period. This is will result in node/edge
deletion and feature removal for the whole network. All these settings result in a highly
complicated streaming network setting for node classification. We denote this network
as full streaming DBLP network, and report the results in Fig. 6-7(c). The results show
that SNOC clearly outperforms all other methods in complicated network setting. More
specifically, at Year 1996, the accuracies of all methods deteriorate with significant drops.
This is mainly because that 1996 is the first time that old nodes are removed from the
network. Our method achieves smallest decline-slope compared to other two methods.
Interestingly, when comparing the results in Fig. 6-7(a) and Fig. 6-7(c), we can find
that the average accuracies of SNOC and GS+LR on networks containing all publications
(Fig. 6-7(a)) are higher than the accuracies on networks only containing publications with
a five-year span (Fig. 6-7(c)). Notice that the former contains a much higher node and edge
density in the network, so when the same sets of nodes are given for classification, the rich
129
topology structures in a dense network will help algorithm improve the node classification
accuracies. For DYCOS, its average accuracy in Fig. 6-7(a) is 1.5% lower than the average
accuracy in Fig. 6-7(c). Notice that DYCOS uses random walks for node classification. For
dense networks, the random walks will contain many irrelevant paths, which deteriorate
the classification accuracy. As a result, its accuracy on five-year span networks are actually
better than the accuracy on the whole networks.
1991 1996 2001 2006 2010
500
1,000
1,500
2,000
2,500
3,000
3,500
Run
time
(s)
SNOCDYCOSGS+LR
(a) DBLP
1 6 11 15
500
1,000
1,500
Run
time
(s)
SNOCDYCOSGS+LR
(b) PubMed Diabetes
Figure 6-8: The cumulative runtime on DBLP and PubMed Diabetes networks correspond-
ing to Fig. 6-5.
6.4.4 Case Study
In Fig. 6-9, we use a case study to demonstrate the performance of the three methods
(SNOC, DYCOS and GS-LR) in handling cases with abrupt network changes. In our
experiments, from time points 1 to 3, the network only contain nodes from four classes
(Hardware and Architecture, Applications and Media, System Technology, and others).
From time point 4 to 6, nodes from a new class (DataBases) are included into the network
(including unlabeled nodes). From time points 7 to 9, new nodes from another new class
(Artificial Intelligence) are introduced into the network.
The results in Fig. 6-9 show that, due to the abrupt inclusion of new class nodes, the
accuracies of all methods decrease. When nodes from the new class continuously arrive,
130
SNOC’s accuracy can quickly recover, because SNF in SNOC can find ideal features to
represent changes in the network and use these features to adjust the node classification.
As a result, SNOC can adapt to the changes in the network for node classification.
Figure 6-9: Case study on DBLP citation network.
In summary, our experiments confirm that, for streaming networks, using features to
capture changes and further classifying nodes by assessing the consistency of node labels
and the network structures can provide effective and efficient solutions for node classifica-
tion.
131
132
Chapter 7
Conclusion
7.1 SUMMARY OF THIS THESIS
In this thesis, we have studied mining problems from the view of complex structure data,
where instances (nodes) are not only characterized by the content but are also subject to
dependency relationships. What is more, with the fast development of information tech-
nology, many current real-world data are always featured with dynamic changes. Accord-
ingly, this thesis explores instance correlation in complex structure data and utilizes it to
make mining tasks more accurate and applicable. Our objective is to combine node correla-
tion with node content and utilize them for three different tasks, including (1) graph stream
classification, (2) super-graph classification and clustering, and (3) streaming network node
classification.
More specifically, in Chapter 3, we proposed an empirical study to reveal the roles
of sub-graph features for graph classification. Existing research has commonly agreed
that finding discriminative sub-graphs to represent graphs is one of the main challenges for
graph classification. Yet there is no comprehensive study about (1) the genuine relationship
between sub-graphs and the classification accuracy, and (2) the actual difference between
sub-graphs discovered from expensive mining process (such as frequent sub-graph mining)
with sub-graphs from simple approaches (such as random sub-graphs). In this thesis, we
empirically validated the relationship between sub-graphs and graph classifiers by vary-
ing (1) the sub-graph feature sizes; (2) the size of the sub-graph feature set; (3) different
133
learning algorithms; and (4) different benchmark datasets. We characterized sub-graphs
discovered from different approaches (including random sub-graphs, frequent sub-graphs,
and frequent sub-graphs selected from Information Gain) by their size (i.e. number of
edges) and by their number (i.e. number of sub-graphs in a set), and validated the perfor-
mance of classifiers trained from these sub-graphs on seven benchmark graph datasets from
three domains. Our study drew a number of important findings, which provide a clear view
about the relations between sub-graphs and graph classifications.
In Chapter 4, we proposed to address graph stream classification. We argued that
in graph stream scenarios, the data volumes and the structures of the graphs may con-
stantly change. The existing sub-graph feature-based representation model is not only
inefficient but also ineffective for graph stream classification. This is because the min-
ing of the sub-graph features is time-consuming and the existing occurrence-based graph
representation model will result in significant information loss and will make sub-graph
features ineffective for represent graph data. To solve the problem, we proposed a graph
factorization-based fine-grained representation model, where the main objective is to use
linear combinations of a set of discriminative cliques to represent graphs for learning. The
optimization-oriented factorization approach ensures minimum information loss for graph
representation, and also avoids the expensive sub-graph isomorphism validation process.
Based on this idea, we proposed a novel framework for fast graph stream classification.
Experiments on two real-world graph streams validated the proposed design for effective
graph stream classification.
In Chapter 5, we first formulated a new super-graph classification problem. Due to the
inherent complex structure representation, all existing graph classification methods cannot
be applied for super-graph classification. In the thesis, we proposed a weighted random
walk kernel which calculates the similarity between two super-graphs by assessing (a) the
similarity between super-nodes of the super-graphs, and (b) the common walks of the super-
graphs. Our key contribution is twofold: (1) a weighted random walk kernel considering
node and structure similarities between graphs; and (2) an effective kernel-based super-
graph classification method with sound theoretical basis.
In Chapter 6, we proposed a novel node classification method for streaming networks.
134
We argued that for networks with continuous changes in structure and node content, fea-
tures are the most effective tool to capture such changes. Accordingly, we proposed to
takes network topology structure and node labels into consideration to find an optimal
subset of features to represent the network. Based on the selected features, a streaming net-
work node classification method, SNOC, is proposed to classify unlabeled nodes through
the minimization of the similarity distance in the network and the feature-based distance
between nodes. Experiments and comparisons demonstrate that SNOC is able to capture
emerging changes and outperform baseline approaches with significant performance gain.
The key innovation of the paper compared to the existing methods is twofold: (1) a new
node classification method for handling streaming networks; and (2) a streaming feature
selection method for networked data where selected features can capture both node content
and network structure dependency.
7.2 FUTURE WORK
Even though we have proposed several mining methods on complex structure data, apply-
ing mining algorithms to very large scale problems and dynamic setting still poses chal-
lenges: (1) how to find an effective representation for a high-dimensional large-scale com-
plex structure data, so as to fit in memory, (2) how to capture the dynamic changes on the
complex structure data to adjust the mining results, and (3) how to balance manifold infor-
mation to achieve effective performance. To the best of our knowledge, there is no work
combining large scale learning and dynamic learning. In our future work, we focus on de-
signing mining algorithms and approaches that are faster, data efficient and less demanding
in computational resources to achieve scalable algorithms for large scale problems.
More specifically, we will extend super-graph classification method to clustering prob-
lem on a huge super-graph. To discover clusters from a large network/graph, existing
methods follow three common approaches: (1) Structure-based Clustering using node con-
nectivity only [54, 79, 64]; (2) Attribute-based Clustering using node attribute similar-
ity [73, 86]; and (3) Structural and Attribute Clustering combining node attribute and node
connectivity similarities [14, 80, 89]. All above methods are, however, inapplicable for
135
super-graph clustering mainly because they cannot take internal structures inside super-
nodes for clustering. Indeed, super-nodes may share some overlapped/intersected struc-
tures to help assess the similarity between super-nodes. In addition, the inter-connected
structures between super-nodes also provide useful structure information for clustering.
The existence of the structure dependency within and between super-nodes require a clus-
tering algorithm to take node internal and external structures for super-graph clustering.
To clustering a super-graphs, the main challenge is to properly calculate similarities
between super-nodes by considering internal structures inside each super-node and inter-
connectivity between super-nodes. The complex super-node structure, where each node
is itself another graph, makes it a very challenging problem. The above challenges mo-
tivated our research to combine structural and content similarity between super-nodes for
clustering.
In the future, we will also focus on finding cluster structures for networked instances
and discovering representative features for each cluster represents a special co-clustering
task usefully for many real-world applications, such as automatic categorization of scien-
tific publications and finding representative key-words for each cluster. To date, although
co-clustering has been commonly used for finding clusters for both instances and features,
all existing methods are focusing on instance-feature relationships, without realizing that
topology structures between instances provide very valuable information to help boost co-
clustering performance. We will try to propose a co-clustering method to ensure that the
final cluster structures are consistent across information from the three aspects with mini-
mum errors.
136
Bibliography
[1] Charu. C. Aggarwal. On classification of graph streams. In Proceedings of EleventhSIAM International Conference on Data Mining (SDM’11), pages 652–663, 2011.
[2] Charu C. Aggarwal and Nan Li. On node classification in dynamic content-based
networks. In SDM, pages 355–366, 2011.
[3] Charu C. Aggarwal and Haixun Wang. Managing and mining graph data. Springer,
2010.
[4] Ralitsa Angelova and Gerhard Weikum. Graph-based text classification: learn from
your neighbors. In ACM SIGIR, pages 485–492, 2006.
[5] Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schönauer, S. V. N. Vishwanathan,
Alex J. Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels.
Bioinformatics, 21(1):47–56, 2005.
[6] Horst Bunke. Recent developments in graph matching. In Proceedings. 15th Interna-tional Conference on Pattern Recognition, pages 117–124, 2000.
[7] Horst Bunke and Kaspar Riesen. Graph classification based on dissimilarity space
embedding. In Structural, Syntactic, and Statistical Pattern Recognition (2008), 2008.
[8] Yandong Cai, Nick Cercone, and Jiawei Han. An attribute-oriented approach for
learning classification rules from relational databases. In ICDE, pages 281–288, 1990.
[9] Jérôme Callut, Kevin Françoisse, Marco Saerens, and Pierre Dupont. Classification
in graphs using discriminative random walks. In MLG, 2008.
[10] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines.
TIST, 2(3), 2011.
[11] Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W. Wah, and Jianyong Wang.
Multi-dimensional regression analysis of time-series data streams. In Proceedings ofthe 28th international conference on Very Large Data Bases, pages 323–334, 2002.
[12] Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu. Discriminative frequent
pattern analysis for effective classification. In ICDE, 2007.
137
[13] Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu. Discriminative frequent
pattern-based graph classification. In Link Mining: Models, Algorithms, and Appli-cations, 2010.
[14] Hong Cheng, Yang Zhou, and Jeffrey Xu Yu. Clustering large attributed graphs: A
balance between structural and attribute similarities. ACM TKDD, 5(2), 2011.
[15] Lianhua Chi, Bin Li, and Xingquan Zhu. Fast graph stream classification using dis-
criminative clique hashing. In Advances in Knowledge Discovery and Data Mining(PAKDD’13), pages 225–236, 2013.
[16] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,
1995.
[17] Andrew D. J. Cross, Richard C. Wilson, and Edwin R. Hancock. Inexact graph match-
ing using genetic search. Pattern Recognition, 30(6):953–970, 1997.
[18] Thiago Henrique Cupertino and Liang Zhao. Bias-guided random walk for network-
based data classification. In Advances in Neural Networks, pages 375–384, 2013.
[19] Gröget Hans Dietmar. On the randomized complexity of monotone graph properties.
Acta Cybernetica, 10(3):119–127, 1992.
[20] Paul D. Dobson and Andrew J. Doig. Distinguishing enzyme structures from non-
enzymes without alignments. Journal of molecular biology, 2003.
[21] Pedro Domingos and Geoff Hulten. Mining high speed data streams. In Proceedingsof the sixth ACM SIGKDD international conference on Knowledge discovery and datamining (KDD’00), pages 71–80, 2000.
[22] Gudes Ehud, Solomon Eyal Shimony, and Natalia Vanetik. Discovering frequent
graph patterns using disjoint paths. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 2006.
[23] Terrence S. Furey, Nello Cristianini, Nigel Duffy, David W. Bednarski, Michèl
Schummer, and David Haussler. Support vector machine classification and vali-
dation of cancer tissue samples using microarray expression data. Bioinformatics,
16(10):906–914, 2000.
[24] Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of graph edit dis-
tance. Pattern Analysis and applications, 13(1):113–129, 2010.
[25] Thomas Gärtner, Peter A. Flach, and Stefan Wrobel. On graph kernels: Hardness
results and efficient alternatives. In COLT, pages 129–143, 2003.
[26] Ray Geagans and Bill McEvily. Network structure and knoweldge transfer: The ef-
fects of cohesion and range. Administrative Science Quarterly, 48(2):240–267, 2003.
138
[27] Gregory Gutin, Anders Yeo, and Alexey Zverovich. Traveling salesman should not be
greedy: Domination analysis of greedy-type heuristics for the TSP. Discrete AppliedMathematics, 2002.
[28] Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: Cur-
rent status and future directions. Data Mining and Knowledge Discovery, 2007.
[29] Johan Himberg, Kalle Korpiaho, Heikki Mannila, and Johanna Tikanmäki. Time
series segmentation for context recognition in mobile devices. In Proceedings IEEEInternational Conference on Data Mining, pages 203–210, 2001.
[30] Kashima Hisashi, Koji Tsuda, and Akihiro Inokuchi. Marginalized kernels between
labeled graphs. In ICML, 2003.
[31] Tamás Horváth, Thomas Gärther, and Stefan Wrobel. Cyclic pattern kernels for pre-
dictive graph mining. In ACM SIGKDD, 2004.
[32] Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs in the
presence of isomorphism. In ICDM, 2003.
[33] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm
for mining frequent substructures from graph data. In PKDD, 2000.
[34] David Jensen, Jennifer Neville, and Brian Gallagher. Why collective inference im-
proves relational classification. In SIGKDD, 2004.
[35] Ning Jin, Calvin Young, and Wei Wang. GAIA: Graph classification using evolution-
ary computation. In ACM SIGKDD, 2010.
[36] J.R.Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
[37] Hisashi Kashima and Akihiro Inokuchi. Kernels for graph classification. In ICDMWorkshop on Active Mining, volume 2002, 2002.
[38] Riesen Kaspar and Horst Bunke. Cluster ensembles based on vector space embed-
dings of graphs. Multiple Classifier Systems, 2009.
[39] Riesen Kaspar and Horst Bunke. Graph classification by means of lipschitz embed-
ding. IEEE Transaction on Systems, Man, and Cybernetics, Part B, 2009.
[40] Riesen Kaspar and Horst Bunke. Graph Classification and Clustering Based on Vec-tor Space Embedding. World Scientific Publishing Co., Inc., 2010.
[41] Xiangnan Kong, Wei Fan, and Philip S. Yu. Dual active feature and sample selection
for graph classification. In ACM SIGKDD, 2011.
[42] Xiangnan Kong and Philip S. Yu. Semi-supervised feature selection for graph clas-
sification. In Proceedings of the 16th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 793–802, 2010.
139
[43] Taku Kudo, Eisaku Maeda, and Yuji Matsumoto. An application of boosting to graph
classification. Advances in neural information processing systems, 2004.
[44] Solomon Kullback. Information theory and statistics. Courier Dover Publications,
1968.
[45] Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In ICDM,
2001.
[46] Bin Li, Xingquan Zhu, Lianhua Chi, and Chengqi Zhang. Nested subtree hash kernels
for large-scale graph classification over streams. In 12th International Conference onData Mining (ICDM’12), pages 399–408, 2012.
[47] Geng Li, Murat Semerci, Bülent Yener, and Mohammed J. Zaki. Graph classification
via topological and label attributes. In 9th Workshop on Mining and Learning withGraphs (with SIGKDD), 2011.
[48] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolic represen-
tation of time series, with implications for streaming algorithms. In Proceedings ofthe 8th ACM SIGMOD workshop on Research issues in data mining and knowledgediscovery, pages 2–11, 2003.
[49] Qing Lu and Lise Getoor. Link-based classification. In ICML, pages 496–503, 2003.
[50] Xiangfeng Luo, Zheng Xu, Jie Yu, and Xue Chen. Building association link network
for semantic link on web resources. IEEE Transactions onAutomation Science andEngineering, 8(3):482–494, 2011.
[51] J. McAuley and J. Leskovec. Image labelling on a network: using social-network
metadata for image classification. In ECCV, volume 4, pages 828–841, 2012.
[52] Kenrick Mock. An experimental framework for email categorization and manage-
ment. In SIGIR, pages 392–393, 2001.
[53] Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona Singh. Whole-
proteome prediction of protein function via graph-theoretic analysis of interaction
maps. Bioinformatics, 21(Suppl.1):i302–i310, 2005.
[54] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in
networks. Phys. Rev. E, 69(2):026113, 2004.
[55] H. J. Nussbaumer. Fast fourier transform and convolution algorithms. Springer seriesin information sciences, 2, 1982.
[56] Sayan Ranu and Ambuj K. Singh. Graphsig: A scalable approach to mining signif-
icant subgraphs in large graph databases. In Proceedings of the 2009 InternationalConference on Data Engineering, pages 844–855, Shanghai, China, March 2009.
140
[57] John W. Raymond, Eleanor J. Gardiner, and Peter Willet. Rascal: Calculation of
graph similarity using maximum common edge subgraphs. The Computer Journal,45(6):3–35, 2002.
[58] Hiroto Saigo, Nicole Krämer, and Koji Tsuda. Partial least squares regression for
graph mining. In Proceedings of the 14th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 578–586, 2008.
[59] Hiroto Saigo, Sebastian Nowozin, Tadashi Kadowaki, Taku Kudo, and Koji Tsuda.
gboost: a mathematical programming approach to graph classification and regression.
Machine Learning, 75(1):69–89, 2009.
[60] Bernhard Schölkopf and Alexander J. Smola. Learning with kernels. MIT Press,
2002.
[61] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, and Lise Getoor. Collective classifi-
cation in network data. In Encyclopedia of Machine Learning, 2010.
[62] D. Serre. Matrices: Theory and Applications. Springer, New York, 2002.
[63] Nino Shervashidze and Karsten M. Borgwardt. Fast subtree kernels on graphs. Ad-vances in Neural Information Processing Systems, 22:1660–1668, 2009.
[64] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans.Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
[65] Aaron Smalter, Jun Huan, Y. Jia, and Gerald Lushington. GPD: a graph pattern dif-
fusion kernel for accurate graph classification with applications in cheminformatics.
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(2):197–
207, 2010.
[66] R. R. Sokal and F. J. Rohlf. Biometry: the principles and practice of statistics inbiological research. W.H. Freeman & Co Ltd, New York, 1981.
[67] Srinivasa Srinath and Sujit Kumar. A platform based on the multi-dimensional data
modal for analysis of bio-molecular structures. In Proceedings of the 29th interna-tional conference on Very large data bases, volume 29, pages 975–986, 2003.
[68] Jacopo Staiano, Bruno Lepri, Nadav Aharony, Fabio Pianesi, Nicu Sebe, and Alex
Pentland. Friends don′t lie-inferring personality traits from social network structure.
In Ubicomp, pages 321–330, 2012.
[69] W. Nick Streetand and YongSeog Kim. A streaming ensemble algorithm (SEA) for
large-scale classification. In Proceedings of the seventh ACM SIGKDD internationalconference on Knowledge discovery and data mining (KDD’01), pages 377–382,
2001.
[70] Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna. Detecting spammers
on social networks. In ACSAC, pages 1–9, 2010.
141
[71] Lei Tang, Huan Liu, Jianping Zhang, and Zohreh Nazeri. Community evolution in
dynamic multi-mode networks. In ACM SIGKDD, 2008.
[72] Lei Tang, Huan Liu, Jianping Zhang, and Zohreh Nazeri. Community evolution in
dynamic multi-mode networks. In Proceedings of the 14th ACM SIGKDD interna-tional conference on Knowledge discovery and data mining (KDD’08), pages 677–
685, 2008.
[73] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. Efficient aggregation for
graph summarisation. In ACM SIGMOD, pages 567–580, 2008.
[74] Roger Ming Hieng Ting and James Bailey. Mining Minimal Contrast Subgraph Pat-terns. University of Melbourne, Department of Computer Science and Software En-
gineering, 2007.
[75] N. Vanetik, E. Gudes, and S. E. Shimony. Computing frequent graph patterns from
semistructured data. In Proceedings of the 2002 IEEE International Conference onData Mining, pages 458–465, 2003.
[76] Joshua T. Vogelstein, William R. Gray, R. Jacob Vogelstein, and Carey E. Priebe.
Graph classification using signal-subgraphs: Applications in statistical connectomics.
Applications in Statistical Connectomics, 2011.
[77] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining concept-drifting data
streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD in-ternational conference on Knowledge discovery and data mining (KDD’03), pages
226–235, 2003.
[78] Xindong Wu, Kui Yu, Wei Ding, Hao Wang, and Xingquan Zhu. Online feature se-
lection with streaming features. IEEE Transactions on Pattern Analysis and MachineIntelligence, 35(5):1178–1192, 2013.
[79] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. Scan: a
structural clustering algorithm for networks. In KDD, pages 824–833, 2007.
[80] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. A model-based
approach to attributed graph clustering. In Proc. of ACM SIGMOD, pages 505–516,
2012.
[81] Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S. Yu. Mining significant graph
patterns by leap search. In ACM SIGMOD, 2008.
[82] Xifeng Yan and Jiawei Han. gSpan: Graph-based substructure pattern mining. In
ICDM, 2003.
[83] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: A frequent structure-based
approach. In ACM SIGMOD, 2004.
142
[84] Xifeng Yan, Philip S. Yu, and Jiawei Han. Substructure similarity search in graph
databases. In ACM SIGMOD, 2005.
[85] Eng-Hui Yap, Tyler Rosche, Steve Almo, and Andras Fiser. Functional clustering
of immunoglobulin superfamily proteins with protein-protein interaction information
calibrated hidden markov model sequence profiles. 426:945–961, 2013.
[86] Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Cross-relational clustering with user’s
guidance. In ACM SIGKDD, pages 344–353, 2005.
[87] Yuchen Zhao, Xiangnan Kong, and Philip S. Yu. Positive and unlabeled learning for
graph classification. In IEEE 11th International Conference on Data Mining, pages
962–971, 2011.
[88] Zheng Zhao and Huan Liu. Semi-supervised feature selection via spectral analysis.
In SDM, pages 641–646, 2007.
[89] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on struc-
tural/attribute similarities. In the VLDB Endowment, volume 2, pages 718–729, 2009.
[90] Yanyue Zhu and Dennis Shasha. Statstream: Statistical monitoring of thousands of
data streams in real time. In Proceedings of the 28th international conference on VeryLarge Data Bases, pages 358–369, 2002.
143