Real-Time Analytics for Complex Structure Data

Real-Time Analytics for Complex Structure Data

Ting Guo

A Thesis submitted for the degree of Doctor of Philosophy

Faculty of Engineering and Information Technology University

of Technology, Sydney 2015

2

Certificate of Authorship and Originality

I certify that the work in this thesis has not previously been submitted for a degree nor

has it been submitted as part of requirements for a degree except as fully acknowledged

within the text.

I also certify that the thesis has been written by me. Any help that I have received in my

research work and the preparation of the thesis itself has been acknowledged. In addition,

I certify that all information sources and literature used are indicated in the thesis.

Signature of Student:

Date:

3

4

Acknowledgments

On having completed this thesis, I am especially thankful to my supervisor Prof. Chengqi

Zhang and co-supervisor Prof. Xingquan Zhu, who had led me to an at one time unfamiliar

area of academic research, and trusted me and given me as much as possible freedom to

purse my own research interests. Prof. Zhu has taught me how to think and study indepen-

dently and how to solve a difficult scientific problem in flexible but rigorous ways. He has

sacrificed much of his precious time for developing my academic research skills. When I

felt lost and terrified with my future, he always gave me the confidence and motivation to

keep going and strive to get better. Prof. Zhang has also given me great help and support

in life.

I am thankful to the group members I met in the University of Technology, Sydney,

including Shirui Pan, Lianhua Chi, Jia Wu, and many others. I learned a lot from these

smart people, and I was always inspired by the interesting and in-depth discussions with

them. I enjoyed the wonderful atmosphere, being with them, of both academic research

and daily life.

I am incredibly grateful to my mother and father for their generosity and encourage-

ment. This thesis is definitely impossible to be completed without their constant support

and understanding. I am also thankful to my friends who have companied me, though not

always at my side, through the arduous journey of three years.

5

6

Contents

1 Introduction 19

1.1 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2 CONTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.3 PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4 THESIS STRUCTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2 Literature Review 31

2.1 PRELIMINARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 GRPAH CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3 FREQUENT SUB-GRAPH MINING (FSM) . . . . . . . . . . . . . . . . 34

2.4 SUB-GRPAH FEATURE SELECTION . . . . . . . . . . . . . . . . . . . 35

2.5 DATA STREAM MINING . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6 REAL-TIME ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.7 ROADMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Understanding the Roles of Sub-graph Features for Graph Classification: An

Empirical Study Perspective 39

3.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 PROBLEM FORMULATION . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Graph and Sub-graph . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Frequent Sub-graph Mining . . . . . . . . . . . . . . . . . . . . . 44

3.2.3 Graph Classification . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 EXPERIMENTAL STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7

3.3.1 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.2 Sub-graph Features . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Graph Hashing and Factorization for Fast Graph Stream Classification 59

4.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 PROBLEM DEFINITION . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 GRAPH FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.1 Factorization Model . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.2 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 FAST GRAPH STREAM CLASSIFICATION . . . . . . . . . . . . . . . . 68

4.4.1 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.2 Graph Clique Mining . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4.3 Clique Set Matrix and Graph Factorization . . . . . . . . . . . . . 72

Discriminative Frequent Cliques . . . . . . . . . . . . . . . . . . . 72

Feature Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.4.4 Graph Stream Classification . . . . . . . . . . . . . . . . . . . . . 75

4.5 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.5.1 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


4.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 79

Graph Steams Classification Accuracy . . . . . . . . . . . . . . . . 79

Graph Steam Classification Efficiency . . . . . . . . . . . . . . . . 83

5 Super-graph based Classification 85

5.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 PROBLEM DEFINITION . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3 OVERALL FRAMEWORK . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4 WEIGHTED RANDOM WALK KERNEL . . . . . . . . . . . . . . . . . 89

5.4.1 Kernel on Single-attribute Graphs . . . . . . . . . . . . . . . . . . 91

8

5.4.2 Kernel on Super-Graphs . . . . . . . . . . . . . . . . . . . . . . . 94

5.5 SUPER-GRAPH CLASSIFICATION . . . . . . . . . . . . . . . . . . . . 96

5.6 THEORETICAL STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.7 EXPERIMENTS AND ANALYSIS . . . . . . . . . . . . . . . . . . . . . 98

5.7.1 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 98


5.7.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 101

6 Streaming Network Node Classification 105

6.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2 PROBLEM DEFINITION AND FRAMEWORK . . . . . . . . . . . . . . 109

6.3 THE PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.3.1 Streaming Network Feature Selection . . . . . . . . . . . . . . . . 113

Feature Selection on a Static Network . . . . . . . . . . . . . . . . 113

Feature Selection on Streaming Networks . . . . . . . . . . . . . . 117

6.3.2 Node Classification on Streaming Networks . . . . . . . . . . . . . 121

6.4 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123


6.4.2 Performance on Static Networks . . . . . . . . . . . . . . . . . . . 125

6.4.3 Performance on Streaming Networks . . . . . . . . . . . . . . . . 128

6.4.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7 Conclusion 133

7.1 SUMMARY OF THIS THESIS . . . . . . . . . . . . . . . . . . . . . . . 133

7.2 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

9

10

List of Figures

2-1 Graph examples collected from different domains. . . . . . . . . . . . . . . 33

2-2 The overall roadmap of this thesis. . . . . . . . . . . . . . . . . . . . . . . 38

3-1 An example of sub-graph pattern representation. Left panel shows two

graphs, G1 and G2 and right panel gives the two indicator vectors showing

whether a sub-graph exists in the graphs. . . . . . . . . . . . . . . . . . . . 40

3-2 The runtime of frequent sub-graph pattern mining with respect to the in-

creasing number of edges of sub-graphs. . . . . . . . . . . . . . . . . . . . 42

3-3 A conceptual view of graph vs. sub-graph. (b) is a sub-graph of (a). . . . . 44

3-4 An example of graph isomorphism. . . . . . . . . . . . . . . . . . . . . . . 45

3-5 Graph representation for a paper (ID17890) in DBLP. Node in red is the

main paper. Nodes in black ellipse are citations. While nodes in black box

are keywords. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3-6 Classification accuracy on five NCI chemical compound datasets with re-

spect to different sizes of sub-graph features (using Support Vector Ma-

chines: Lib-SVM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3-7 Classification accuracy on D&D protein dataset and DBLP citation dataset

with respect to different sizes of sub-graph features (using Support Vector

Machines: Lib-SVM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3-8 Classification accuracy on one NCI chemical compound dataset, D&D pro-

tein dataset, and DBLP citation dataset with respect to different sizes of

sub-graph features (using Nearest Neighbours: NN). . . . . . . . . . . . . 54

4-1 Coarse-grained vs. fine-grained representation. . . . . . . . . . . . . . . . 60

11

4-2 An example of graph factorization. . . . . . . . . . . . . . . . . . . . . . . 64

4-3 The framework of FGSC for graph stream classification. . . . . . . . . . . 69

4-4 An example of clique mining in a compressed graph. . . . . . . . . . . . . 71

4-5 An example of “in-memory” Clique-class table Γ. . . . . . . . . . . . . . . 74

4-6 Graph representation for a paper (ID17890) in DBLP. . . . . . . . . . . . . 77

4-7 Accuracy w.r.t different chunk sizes on DBLP Stream. The number of

features in each chunk is 142. The batch sizes vary as: (a) 1000; (b) 800;

(c) 600. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4-8 Accuracy w.r.t different number of features on DBLP Stream with each

chunk containing 1000 graphs. The number of features selected in each

chunk is: (a) 307; (b) 142; (c) 62. . . . . . . . . . . . . . . . . . . . . . . 80

4-9 Accuracy w.r.t different classification methods on DBLP Stream with each

chunk containing 1000 graphs, and the number of features in each chunk

is 142. The classification methods selected here are: (a) NN; (b) SMO; (c)

NaiveBayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4-10 Accuracy w.r.t different chunk sizes on IBM Stream. The number of fea-

tures in each chunk is 75. The batch sizes vary from (a) 500; (b) 400; to (c)

300. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4-11 Accuracy w.r.t different number of features on IBM Stream with each

chunk containing 400 graphs. The number of features selected in each

chunk is: (a) 148; (b) 75; (c) 43. . . . . . . . . . . . . . . . . . . . . . . . 81

4-12 Accuracy w.r.t different classification methods on IBM Stream with each

chunk containing 400 graphs, and the number of features in each chunk

is 75. The classification methods selected include: (a) NN; (b) SMO; (c)

NaiveBayes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4-13 System accumulated runtime-based by using NN classifier, where |D| =1000, |m| = 142 (for DBLP) and |D| = 400, |m| = 75 (for IBM) respec-

tively. (a) Results on DBLP stream; (b) Results on IBM stream. . . . . . . 83

5-1 (A): a single-attribute graph; (B): an attributed graph; and (C): a super-graph. 86

12

5-2 A conceptual view of a protein interaction network using super-graph rep-

resentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5-3 WRWK on the super-graphs (G, G′) and the single-attribute graphs (g1,

g2, g3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5-4 An example of using super-graph representation for scientific publications. . 99

5-5 Super-graph and comparison graph representations. . . . . . . . . . . . . . 100

5-6 Classification accuracy on DBLP and Beer Review datasets w.r.t. different

classification methods (NB, DT, SVM, and NN). . . . . . . . . . . . . . . 102

5-7 Classification accuracy on Beer Review dataset w.r.t. different datasets and

classification methods (NB, DT, SVM, and NN). . . . . . . . . . . . . . . 103

5-8 The performance w.r.t. different edge-cutting thresholds on DBLP and Beer

Review datasets by using WRWK method. . . . . . . . . . . . . . . . . . 104

6-1 An example of streaming networks, where each color bar denotes a feature. 106

6-2 An example of using feature selection to capture changes in a streaming

network (keywords inside each node denote node content). . . . . . . . . . 108

6-3 The framework of the proposed streaming network node classification (SNOC)

method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6-4 An example of using feature selection to capture structure similarity. . . . . 115

6-5 The accuracies on three real-world static networks w.r.t. different numbers

of selected features (from 50 to 300). . . . . . . . . . . . . . . . . . . . . . 125

6-6 The accuracy on three networks w.r.t. (a) different maximal lengths of path

l (from 1 to 5), (b) different values of weight parameter ξ (from 0 to 1), and

(c) different percentages of labeled nodes. . . . . . . . . . . . . . . . . . . 126

6-7 The accuracy on streaming networks: (a) accuracy on DBLP citation net-

work from 1991 to 2010, (b) accuracy on PubMed Diabetes network for

15 time points, and (c) accuracy on extended DBLP citation network from

1991 to 2010. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6-8 The cumulative runtime on DBLP and PubMed Diabetes networks corre-

sponding to Fig. 6-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

13

6-9 Case study on DBLP citation network. . . . . . . . . . . . . . . . . . . . . 131

14

List of Tables

3.1 The advantages and disadvantages comparisons between vector represen-

tation vs. graph representation . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 NCI datasets used in experiments . . . . . . . . . . . . . . . . . . . . . . . 47

3.3 DBLP dataset used in experiments . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Number of sub-graphs with respect to different sizes (i.e. number of edges) 49

4.1 DBLP dataset used in experiments. . . . . . . . . . . . . . . . . . . . . . . 76

6.1 Accuracy Results on Static Network. . . . . . . . . . . . . . . . . . . . . . 126

15

Abstract

The advancement of data acquisition and analysis technology has resulted in many real-

world data being dynamic and containing rich content and structured information. More

specifically, with the fast development of information technology, many current real-world

data are always featured with dynamic changes, such as new instances, new nodes and

edges, and modifications to the node content. Different from traditional data, which are

represented as feature vectors, data with complex relationships are often represented as

graphs to denote the content of the data entries and their structural relationships, where

instances (nodes) are not only characterized by the content but are also subject to depen-

dency relationships. Plus, real-time availability is one of outstanding features of today’s

data. Real-time analytics is dynamic analysis and reporting based on data entered into a

system before the actual time of use. Real-time analytics emphasizes on deriving immedi-

ate knowledge from dynamic data sources, such as data streams, and knowledge discovery

and pattern mining are facing complex, dynamic data sources. However, how to combine

structure information and node content information for accurate and real-time data mining

is still a big challenge. Accordingly, this thesis focuses on real-time analytics for com-

plex structure data. We explore instance correlation in complex structure data and utilises

it to make mining tasks more accurate and applicable. To be specific, our objective is to

combine node correlation with node content and utilize them for three different tasks, in-

cluding (1) graph stream classification, (2) super-graph classification and clustering, and

(3) streaming network node classification.

Understanding the role of structured patterns for graph classification: the thesis in-

troduces existing works on data mining from an complex structured perspective. Then we

propose a graph factorization-based fine-grained representation model, where the main ob-

jective is to use linear combinations of a set of discriminative cliques to represent graphs

for learning. The optimization-oriented factorization approach ensures minimum informa-

tion loss for graph representation, and also avoids the expensive sub-graph isomorphism

validation process. Based on this idea, we propose a novel framework for fast graph stream

classification.

16

A new structure data classification algorithm: The second method introduces a new

super-graph classification and clustering problem. Due to the inherent complex struc-

ture representation, all existing graph classification methods cannot be applied to super-

graph classification. In the thesis, we propose a weighted random walk kernel which cal-

culates the similarity between two super-graphs by assessing (a) the similarity between

super-nodes of the super-graphs, and (b) the common walks of the super-graphs. Our key

contribution is: (1) a new super-node and super-graph structure to enrich existing graph

representation for real-world applications; (2) a weighted random walk kernel considering

node and structure similarities between graphs; (3) a mixed-similarity considering struc-

tured content inside super-nodes and structural dependency between super-nodes; and (4)

an effective kernel-based super-graph classification method with sound theoretical basis.

Empirical studies show that the proposed methods significantly outperform the state-of-

the-art methods.

Real-time analytics framework for dynamic complex structure data For streaming net-

works, the essential challenge is to properly capture the dynamic evolution of the node

content and node interactions in order to support node classification. While streaming net-

works are dynamically evolving, for a short temporal period, a subset of salient features are

essentially tied to the network content and structures, and therefore can be used to charac-

terize the network for classification. To achieve this goal, we propose to carry out streaming

network feature selection (SNF) from the network, and use selected features as gauge to

classify unlabeled nodes. A Laplacian based quality criterion is proposed to guide the node

classification, where the Laplacian matrix is generated based on node labels and network

topology structures. Node classification is achieved by finding the class label that results in

the minimal gauging value with respect to the selected features. By frequently updating the

features selected from the network, node classification can quickly adapt to the changes in

the network for maximal performance gain. Experiments and comparisons on real-world

networks demonstrate that SNOC is able to capture dynamics in the network structures and

node content, and outperforms baseline approaches with significant performance gain.

17

18

Chapter 1

Introduction

1.1 MOTIVATION

The advancement of data acquisition and analysis technology has resulted in many applica-

tions involving complex structure data; examples include Cheminformatics [65], Bioinfor-

matics [53], and Social Network Analysis (e.g. DBLP) [3]. Different from traditional data,

which are represented as feature vectors, data with structure relationships are often repre-

sented as graphs to preserve the content of the data entries and their relationships, where

instances (nodes) are not only characterized by the content but are also subject to depen-

dency relationships. For example, each node in a social network can denote one person and

links between nodes can represent their social interactions.

In reality, changes are essential components in real-world structure data, mainly be-

cause user participation, interactions, and responses to external factors continuously intro-

duce new nodes and edges to the data. In addition, a user may add/delete/modify online

posts, which naturally result in changes in the node content. As a result, data are inherently

dynamic. These dynamic changes may significantly influence the mining results. For ex-

ample, changes in users’ interest fields may result in the changes of their social groups or

friend circles. So real-time analytics on structure data could help us understand and capture

such changes and therefore improve the mining performances.

Graph classification concerns the learning of a discriminative classifier, from training

data containing structure information, to classify previously unseen graph samples into spe-

19

cific categories, where the main challenge is to explore structure information in the training

data to build classifiers. One of the most common graph classification approaches is to

use sub-graph features to convert graph into instance-feature representations, so generic

learning algorithms can be applied for classification. Finding good sub-graph features is

therefore an important task for this type of learning approaches. Common heuristics are to

find high frequency sub-graphs or to apply feature selection measures, such as information

gain or Gini-index, to refine frequent sub-graphs. While all these methods have shown to

be effective in the literature, due to the inherent complexity of the graph data, the process of

finding sub-graphs is a non-trivial task and is always the most computationally expensive

procedure in the graph classification process. In addition, the genuine discriminative power

of the sub-graph features has never been comprehensively studied. All these observations

raise big concerns:

• The mining of the sub-graphs involves graph matching operation which is computa-

tionally demanding. Indeed, just testing whether a graph is a sub-graph of another

graph is an NP-complete problem [74]. It is very time-consuming to match patterns

in graph datasets, especially for sub-graphs with a large number of edges.

• We are interested in finding discriminative power of sub-graphs. In other words, we

want to know how much the classification accuracy can be affected by different prop-

erties of the selected sub-graphs, such as different number of nodes, different number

of edges, and the size of the selected sub-graph set. To achieve the goal, a typical ob-

jective function, information gain (IG) [36], is used to test the discriminative power

of attributions (sub-graphs). Experiments show that the objective function are nei-

ther monotonic nor anti-monotonic with respect to the size of the sub-graphs [81].

In addition, the internal structural correlation between sub-graphs can also impact on

the classification result. For instance, sub-graphs with similar structures tend to have

similar objective scores, whereas using sub-graphs with high correlations for clas-

sification should be avoided because it will introduce redundant features with high

dependency which may deteriorate the classification accuracy. This fact can help to

select better sub-graphs to represent graph datasets.

20

• Most given graph classification methods are computationally expensive, we are won-

dering whether there is any inexpensive way to achieve graph classification. More

specifically, if we use some random sub-graphs to represent a graph datasets, how

good the classification accuracy will be, compared to the approaches that use sophis-

ticated sub-graph mining algorithms?

Motivated by the above concerns, empirical studies on the role of sub-graph for graph

classification. We carry out empirical studies on four real-world graph classification tasks,

by using three types of sub-graph features, including frequent sub-graphs, frequent sub-

graph selected by using information gain, and random sub-graphs, and by using two types

of learning algorithms including Support Vector Machines and Nearest Neighbour. Our

experiments show that (1) The discriminative power of sub-graphs varies by their sizes;

(2) Random sub-graphs have a reasonably good performance; (3) Number of sub-graphs

is important to ensure good performance; (4) Increasing number of sub-graphs reduces the

difference between classifiers built from different sub-graphs, and (5) Performance with

respect to different classifiers shows similar trend. Our studies provide a practical guidance

for designing effective sub-graph based graph classification methods.

From the empirical studies, we discover a fact that unlike traditional instance-feature

representations, graphs do not have specific features, so an essential step in building a graph

classification model is to explore graph features to represent graph data in an instance-

feature space for effective learning [57]. The majority of existing graph classification mod-

els employ an occurrence-based feature representation model, in which a set of sub-graph

features is selected to represent the graph data by using the occurrence of the sub-graph in

the graph (either 0/1 occurrence or actual number of occurrences) as the feature value. In

this thesis, we refer to such an occurrence based representation model as coarse-grained

representation model.

For existing sub-graph based graph classification methods (coarse-grained representa-

tion models), they have a number of disadvantages, especially for graph streaming clas-

sification including: Computation burden for isomorphism validation: Because sub-

graphs are selected as features to represent the graph, the expensive sub-graph mining and

isomorphism validation process can not be avoided (graph isomorphism is proven to be

21

NP-hard) [19]. After the sub-graph features have been selected, the isomorphism valida-

tion process has to be applied again to map the graphs into the vector space. Because

sub-graphs are selected as features to represent the graph and general learning methods

prefer low feature dimension, a coarse-grained feature representation model has to limit

the number of sub-graph to represent the graph data. Severe and unbounded informa-

tion loss: Because the coarse-grained representation does not characterize the degree of

closeness of the sub-graph pattern related to the graph, it suffers information loss in repre-

senting the graph. Furthermore, for a graph stream with continuously growing volumes (i.e.

an increasing number of graphs) and changing structures (e.g. new nodes may appear in

the coming graphs), the disadvantage of the existing coarse-grained graph feature model is

even greater. This is because (1) stream volumes continuously increase, making it compu-

tationally expensive to explore subgraph features, and (2) the changes in the graph stream

(such as new structures) make the sub-graph features incapable of representing graphs. So

a fine-grained graph feature model is needed to improve the classification performance.

To present a novel fine-grained graph feature model, our main idea is to bypass the

expensive sub-graph mining process and use linear combinations of graph cliques to rep-

resent graphs. Linear combinations is to make sure the information loss of the transform

process from graphs to vectors can be minimal. Compared to the traditional coarse-grained

representation model, the advantage of fine-grained representation is clear. A clear advan-

tage of our method is that fine-grained graph representation enjoys a rigorous theoretical

foundation that ensures minimal information loss for graph representation, and also avoids

the expensive sub-graph isomorphism validation process, which is proven to be NP-hard.

To achieve this goal, our algorithm relies on two important steps to extract fine-grained

graph representation: (1) finding a set of frequent graph cliques as the base; and (2) us-

ing graph factorization to calculate a linear combination of graph cliques to best represent

a graph. After applying our fine-grained feature mapping model to represent graphs, we

can use any learning algorithm, such as Nearest Neighbor or Support Vector Machines, for

graph stream classification. Our method offers a number of advantages including a fast fea-

ture mining process, more precise graph representation, and better performance for graph

stream classification. Experiments on two real-world network graph data demonstrate that

22

our method outperforms state-of-the-art approaches in both classification accuracy and run-

time efficiency.

Even though graph representation indeed preserves the structure information to improve

the classification accuracy, all existing frameworks, in terms of using graphs to represent

objects, rely on two approaches to describe node content (1) node as a single attribute:

each node has only one attribute (single-attribute node). A clear drawback of this represen-

tation is that a single attribute cannot precisely describe the node content [40]. This repre-

sentation is commonly referred to as a single-attribute graph (Fig.5-1 (A)). (2) node as a

set of attributes: use a set of independent attributes to describe the node content (Fig.5-1

(B)). This representation is commonly referred to as an attributed graph [8, 14, 80]. How-

ever, with the development of information and technology, the data is becoming more and

more complex. Indeed, in many applications, the attributes/properties used to describe the

node content may be subject to dependency structures. For example, in a citation network

each node represents one paper and edges denote citation relationships. It is insufficient to

use one or multiple independent attributes to describe detailed information of a paper. In-

stead, we can represent the content of each paper as a graph with nodes denoting keywords

and edges representing contextual correlations between keywords (e.g. co-occurrence of

keywords in different sentences or paragraphs). As a result, each paper (a super-node) and

all references cited in this thesis can form a super-graph with each edge between papers

denoting their citation relationships.

To build learning models for super-graphs, the mainly challenge is to properly calculate

the distance between two super-graphs.

• Similarity between two super-nodes: Because each super-node is a graph, the over-

lapped/intersected graph structure between two super-nodes reveals the similarity

between two super-nodes, as well as the relationship between two super-graphs. Tra-

ditional hard-node-matching mechanism is unsuitable for super-graphs which require

soft-node-matching.

• Similarity between two super-graphs: The complex structure of super-graph re-

quires that the similarity measure considers not only the structure similarity, but also

23

the super-node similarity between two super-graphs. This cannot be achieved with-

out combining node matching and graph matching as a whole to assess similarity

between super-graphs.

The above challenges motivate the proposed Weighted Random Walk Kernel (WRWK)

for super-graphs. In our paper, we generate a new product graph from two super-graphs

and then use weighted random walks on the product graph to calculate similarity between

super-graphs. A weighted random walk denotes a walk starting from a random weighted

node and following succeeding weighted nodes and edges in a random manner. The weight

of the node in the product graph denotes the similarity of two super-nodes. Given a set

of labeled super-graphs, we can use an weighted product graph to establish walk-based

relationship between two super-graphs and calculate their similarities. After that, we can

obtain the kernel matrix for super-graph classification.

Recent years have also witnessed an increasing number of applications involving net-

worked data, where instances are not only characterized by the content but are also subject

to dependency relationships. The mixed node content and structure information raise many

unique data mining tasks, such as network node classification [3]. In reality, changes are

essential components in real-world networks, mainly because user participation, interac-

tions, and responses to external factors continuously introduce new nodes and edges to the

network. In addition, a user may add/delete/modify online posts, which naturally result in

changes in the node content. As a result, the networks are inherently dynamic. Accurate

node classification in a streaming network setting is therefore much more challenging than

static networks. In summary, node classification in streaming networks has at least three

major challenges:

• Streaming network structures: Network structures encode rich information about

node interactions inside the network, which should be considered for node classifi-

cation. In streaming networks, structures are constantly changing, so node classi-

fication needs to rapidly capture and adapt to such changes for maximal accuracy

gain.

• Streaming node features: For each node in a streaming network, its content may

24

constantly evolve (w.r.t. users posts or profile updating). As a result, the feature space

used to denote the node content is dynamically changing, resulting in streaming fea-

tures [78] with infinite feature space. To capture changes, a feature selection method

should timely select the most effective features to ensure that node classification can

quickly adapt to the new network.

• Unlimited Network Node Space: Because node volumes of streaming networks are

dynamically increasing, resulting in unlimited network node space and new nodes

never appearing in the network before. Node classification needs to scale to the dy-

namic increasing node volumes and incrementally updates models discovered from

historical data to accurately classify new nodes.

To achieve a high node classification accuracy, a fundamental issue is to properly char-

acterize such changes. One possible solution is to use features to capture changes in

streaming networks for node classification. For streaming networks, changes are intro-

duced through two major channels (1) node content; and (2) topology structures. Because

in a networked world, nodes close to each other in the network structure space tend to

share common content information [26], we can use selected features to design a “similar-

ity gauging” procedure to assess the consistency of the network node content and structures

to determine the labels of unlabeled nodes. A smaller gauging value indicates that the node

content and structures has a better alignment with the node label. So the gauging based

classification is carried out such that for an unlabeled node, its label is the class which

results in the minimal gauging value with respect to the identified features. By updating

the selected features, the node classification can automatically adapt to the changes in the

streaming network for maximal accuracy gain. Accordingly, our research will propose a

node classification method for networked data, and then propose a streaming feature selec-

tion approach for dynamic drafting network.

In summary, taking complex structure information into account helps improve the min-

ing performance on structure data. Moreover, employing changes in dynamic setting is

a promising solution to further improving mining accuracy. In this thesis, we investigate

different types of complex structure data and propose several methods to solve above chal-

25

lenges.

1.2 CONTRIBUTIONS

This thesis focuses on exploring and utilizing structure information to solve some important

problems in data mining. We list our contributions to each of them below:

• Exploring the roles of sub-graph features for graph classification: we carry out

empirical studies on four real-world graph classification tasks, by using three types

of sub-graph features, including frequent sub-graphs, frequent sub-graph selected by

using information gain, and random sub-graphs, and by using two types of learning

algorithms including Support Vector Machines and Nearest Neighbour. Our experi-

ments show that (1) The discriminative power of sub-graphs varies by their sizes; (2)

Random sub-graphs have a reasonably good performance; (3) Number of sub-graphs

is important to ensure good performance; (4) Increasing number of sub-graphs re-

duces the difference between classifiers built from different sub-graphs, and (5) Per-

formance with respect to different classifiers shows similar trend. Our studies pro-

vide a practical guidance for designing effective sub-graph based graph classification

methods.

• Exploring clique information and using factorization method for fast graph

stream classification: we propose a fine-grained graph factorization framework for

efficient graph stream classification in this thesis. Being fine-grained, our mapping

framework relies on a set of discriminative frequent cliques, instead of important

sub-graph patterns, to represent graphs. Such a fine-grained representation ensures

that the final instance-feature representation is sufficiently close to the original graph,

with theoretical guarantee. To solve the problem, our algorithm relies on two impor-

tant steps to extract fine-grained graph representation: (1) finding a set of frequent

graph cliques as the base; and (2) using graph factorization to calculate a linear com-

bination of the graph cliques to best represent a graph. Compared to the traditional

coarse-grained representation model, a clear advantage of our method is that fine-

26

grained graph representation enjoys a rigorous theoretical foundation that ensures

minimal information loss for graph representation, and also avoids the expensive

sub-graph isomorphism validation process.

• Exploring inner-structure information for super-graph classification: In this the-

sis, we introduce a special type of graph, where the content of the node can be

represented as a graph, as a “super-graph”. Likewise, we refer to the node whose

content is represented as a graph, as a “super-node”. To build learning models for

super-graphs, the mainly challenge is to properly calculate the distance between two

super-graphs.

(1) Similarity between two super-nodes: Because each super-node is a graph, the

overlapped/intersected graph structure between two super-nodes reveals the similar-

ity between two super-nodes, as well as the relationship between two super-graphs.

Traditional hard-node-matching mechanism is unsuitable for super-graphs which re-

quire soft-node-matching.

(2) Similarity between two super-graphs: The complex structure of super-graph

requires that the similarity measure considers not only the structure similarity, but

also the node similarity between two super-graphs. This cannot be achieved with-



The above challenges motivate the proposed Weighted Random Walk Kernel for

super-graphs. we propose a weighted random walk kernel which calculates the simi-

larity between two super-graphs by assessing (a) the similarity between super-nodes

of the super-graphs, and (b) the common walks of the super-graphs. Our key contri-

bution is twofold: (1) a weighted random walk kernel considering node and structure

similarities between graphs; and (2) an effective kernel-based super-graph classifica-

tion method with sound theoretical basis.

• Exploring manifold information for streaming network node nlassification: we

propose a novel node classification method for streaming networks. Our method

takes network structure and node labels into consideration to find an optimal subset

27

of features to represent the network. Based on the selected features, a streaming

network node classification method, SNOC, is proposed to classify unlabeled nodes

through the minimization of the similarity distance in the network and the feature-

based distance between nodes. The main contribution, compared to existing works,

is twofold:

(1) Streaming Network Node Classification: We propose a new streaming network

node classification (SNOC) method that takes node content and structure similarity

into consideration to find important features to model changes in the network for

node classification. This method is not only more accurate than existing node clas-

sification approaches, but is also effective to capture changes in networks for node

classification.

(2) Streaming Network Feature Selection: We introduce a novel streaming net-

work feature selection framework, SNF, for streaming networks. To ensure feature

evaluation can timely adapt to changes in the network, SNF incrementally updates

the evaluation score of an existing feature by accumulating changes in the network.

This allows our method to effectively handle streaming networks with changing fea-

ture space and feature distributions for better runtime and performance gain.

1.3 PUBLICATIONS

• Ting Guo, Zhanshan Li, and Xingquan Zhu. Large Scale Diagnosis Using Associ-

ations between System Outputs and Components. Proceedings of the Twenty-Fifth

AAAI Conference on Artificial Intelligence, 2011, pp.1786-1787.

• Ting Guo and Xingquan Zhu. Understanding the roles of sub-graph features for

graph classification: an empirical study perspective. Proceedings of the 22nd ACM

international Conference on Information & Knowledge Management, 2013, pp. 817-

822.

• Ting Guo, Lianhua Chi, and Xingquan Zhu. Graph hashing and factorization for fast

graph stream classification. Proceedings of the 22nd ACM international Conference

28

on Information & Knowledge Management, 2013, pp. 1607-1612.

• Ting Guo and Xingquan Zhu. Super-Graph Classification. Proceedings of Advances

in Knowledge Discovery and Data Mining, 2014, pp. 323-336.

• Ting Guo, Xingquan Zhu, Jian Pei and Chengqi Zhang. SNOC: Streaming Network

Node Classification. Proceedings of the 14th IEEE International Conference on Data

Mining, 2014.

1.4 THESIS STRUCTURE

The rest of thesis is summarized as follows:

Chapter 2: This chapter surveys existing works on mining structure data. It summarizes

major approaches in the field, along with their technical strengths/weaknesses.

Chapter 3: In this chapter, we carry out empirical studies on four real-world graph clas-

sification tasks, by using three types of sub-graph features, including frequent sub-graphs,

frequent sub-graph selected by using information gain, and random sub-graphs. The two

types of learning algorithms include Support Vector Machines and Nearest Neighbour. Our

studies provide a practical guidance for designing effective sub-graph based graph classifi-

cation methods.

Chapter 4: We propose a fine-grained graph factorization approach for Fast Graph

Stream Classification (FGSC). A graph clique mining is given and the factorization al-

gorithm is provided based on the clique mining result. Experiments demonstrate that the

proposed method outperforms state-of-the-art approaches in both classification accurracy

and efficiency.

Chapter 5: In this chapter, we formulate a new super-graph classification task where

each node of the super-graph may contain a graph. To support super-graph classification,

we propose a Weighted Random Walk Kernel (WRWK) with sound theoretical properties,

including bounded similarity. Experiments confirm that our method significantly outper-

forms baseline approaches.

Chapter 6: This chapter introduces a new classification method for streaming networks,

29

namely streaming network node classification (SNOC). It provides algorithm details, theo-

retical proof, time complexity and experiments.

Chapter 7: This chapter concludes this thesis and outlines directions for future work.

30

Chapter 2

Literature Review

This literature review provides an in-depth study on how existing data mining methods

on complex structure data. Our main objective is to (1) summarize and categorize graph

classification methods, graph clustering methods and extended ones to streaming structure

data; and (2) compare and analyze the strengths and deficiencies of existing approaches.

Firstly, we introduce some basic definitions.

2.1 PRELIMINARY

DEFINITION 1 Graph: A graph G is a set of vertex (nodes) v connected by edges (links)

e. Thus G = (v, e).

DEFINITION 2 Vertex (Node): A node v is a terminal point or an intersection point of a

graph. It is the abstraction of a location such as a city, an administrative division, a road

intersection or a transport terminal (stations, terminuses, harbors and airports).

DEFINITION 3 Edge (Link): An edge e is a link between two nodes. The link (i, j) is

of initial extremity i and of terminal extremity j. A link is the abstraction of a transport

infrastructure supporting movements between nodes. It has a direction that is commonly

represented as an arrow. When an arrow is not used, it is assumed the link is bi-directional.

DEFINITION 4 Sub-Graph: A sub-graph is a subset of a graph G, where p is the num-

ber of sub-graphs. For instance G′ = (v′, e′) can be a distinct sub-graph of G. Unless

31

the global transport system is considered in its whole, every transport network is in the-

ory a sub-graph of another. For instance, the road transportation network of a city is a

sub-graph of a regional transportation network, which is itself a sub-graph of a national

transportation network.

2.2 GRPAH CLASSIFICATION

Given a collection of training samples, each of which is suitably labeled with a class label,

classification task is to build a learning model to automatically assign a previously unseen

example into a specific category (or class). For data with dependency structure, such as

the ones showing in Figure 2-11, building a learning model to classify them into respective

categories is proved to be a challenging task. This is mainly attributed to the fact that graph

data normally do not have features immediately available to support the learning, whereas

nearly all existing supervised learning methods requires that training data to be represented

as tabular instance-feature format for learning.

In order to support graph classification, one of the fundamental challenges is to charac-

terize graphs to represent them as tabula-feature formats, so the generic supervised learning

algorithm can be used to derive learning models from graphs. Three most popularly used

approaches include (1) Sub-graph feature based methods; (2) Global structure based meth-

ods, such as graph edit and graph embedding, and (3) Graph kernel methods.

Sub-graph feature based methods is using substructure patterns (sub-graphs) as features

to represent each graph as a feature vector. So a training graph set can be converted into a

generic training instance set for learning. Sub-graph based approaches are useful in many

graph related tasks, including discriminating different groups of graphs, classifying and

clustering graphs and building graph indices in vector spaces. An inherent advantage of

embedding graphs into vector space is that it makes existing algorithmic tools developed

1In Fig. 2-1, the upper ones are the original representations and the bottom ones are the corresponding

transferred graphs (a) Protein data (each node denotes a local amino acid region with special secondary

structures and each edge denotes the nearest neighbour in the space); (b) a graph collected from chemical

compound data (the 3D-structure can be transferred into graphs by using molecules connected with chemical

bonds); (c) a graph collected from DBLP citation dataset, where each node denotes a Paper ID or a keyword

and each edge denotes the citation relationship between papers or keywords appeared in the paper’s title. The

label of the graph (y) can be used for training a learning model for graph classification.

32

Figure 2-1: Graph examples collected from different domains.

for feature based object representations available for graph structure data. For sub-graph

feature based methods, one key challenge is how to select discriminative sub-graphs to

mapping graphs into vector space, which can help improve the classification accuracy. To

achieve the goal, several objective functions are used to test the discriminative power of

attributions (sub-graphs), like frequency, Information Gain (IG) [36], Fisher Score [23]

and Laplace Score [42]. Because sub-graph features can help convert each single graph

into a vector representation, graph classification can be achieved by using popular learning

algorithms such as Decision Trees [36] and Support Vector Machine (SVM) [16].

Global structure based methods is to apply inexact graph matching, in which error cor-

rection is made part of the matching process. Central to this approach is the measurement of

the similarity of pairwise graphs. This can be measured in many ways. One of them, which

has garnered particular interest as it is error-tolerant to noise and distortion, is the graph

edit distance (GED), defined as the cost of the least expensive sequence of edit operations

that are needed to transform one graph into another [24]. GED algorithms are influenced

considerably by cost functions that are related to edit operations. The GED between pair-

wise graphs changes with the change of cost functions and its validity is dependent on the

33

rationality of cost functions definition [6][17].

Graph kernel methods is using kernel to calculate the distance between graphs. Various

graph kernel methods, such as subtree kernels [63], shortest-path kernels [5], joint ker-

nels [30] and cyclic pattern kernels [31], have been proposed. All methods have shown

to be effective, but subject to high computational overhead. Geng Li et al. proposes an

alternative approach based on feature vectors constructed from different global topologi-

cal attributes [47]. A novel graph representation method named Graph edit distance is also

proposed by using dissimilarity space embedding graph kernel [38], where the edit distance

can be calculated by Lipschitz embedding method [39]. Experiments show that the classifi-

cation accuracy can be statistically significantly enhanced by using graph edit distance with

prototype reduction and dimensionality reduction [7]. Vogelstein et al. develops a set of

signal-subgraph estimators for graph classification [76]. This estimator can be considered

as local sparse and low-rank matrix decompositions.

2.3 FREQUENT SUB-GRAPH MINING (FSM)

Sub-graph patterns have been proved a good representation for closing the gap between

structural and statistical pattern classification. Selecting features, in the form of sub-graph

patterns, from graph data is a well established area in graph data mining, where early

methods often use Apriori-based Graph Mining (AGM) to identify frequent induced sub-

graphs [33]. Evaluation of AGM on chemical carcinogenesis data demonstrated that it is

more efficient than an inductive logic programming based approach combined with a level-

wise search. Meanwhile, Kuramochi and Karypis develop an FSG [45] method which uses

Breath First Search (BFS) strategy to grow candidates whereby pairs of identified frequent

k sub-graphs are joined to generate (k + 1) sub-graphs. FSG uses a canonical labeling

method for graph comparison and calculates the support of the patterns using a vertical

transaction list data representation. Experiments show that FSG is inefficient when graphs

contain many vertexes and edges that have identical labels because the join operation used

by FSG allows multiple automorphism of single or multiple cores. In summary, AGM and

FSG are all Apriori-based frequent sub-graph mining algorithms.

34

In Apriori-based frequent sub-graph mining algorithms, it is time-consuming and com-

putationally ineffective when two size-k frequent sub-graphs are joined to generate size-

(k+1) graph candidates [28]. To improve the efficiency, several pattern-growth-based graph

mining approaches have been proposed. FFSM [32] attempts to extend graphs from a single

sub-graph directly. For each graph G that has been discovered, FFSM recursively adds new

edges until all frequent supergraphs of G are discovered. The recursion stops if no more

frequent graphs can be generated. Because the pattern-growth algorithm extends a frequent

graph pattern by adding a new edge with every possible label, there is a potential risk that

same graph patterns may be generated more than once [28]. gSpan algorithm [82] solves

this problem by using a right-most extension technique. It uses Depth Fist Search (DFS)

lexicographic ordering to construct a tree-like lattice over all possible graph patterns. The

search tree is traversed in a DFS manner and all possible graph patterns with non-minimal

DFS codes are pruned so that redundant candidate generation is avoided. gSpan is arguably

the most frequently cited FSM algorithm.

2.4 SUB-GRPAH FEATURE SELECTION

To find discriminate sub-graph features for graph classification, common methods are to

first discover a complete set of sub-graph patterns by using a minimum support threshold

or other parameters such as sub-graph size limit [13], and then use a feature selection crite-

rion (such as Information Gain [12, 36]) to select a small set of discriminative sub-graphs

as features. Such a two-step sub-graph feature selection approach is considered ineffec-

tive and several methods are proposed to directly generate a compact set of discriminative

sub-graphs during the frequent sub-graph mining process. Kudo et al. [43, 59] proposes

a boosting based method, where a boosting algorithm repeatedly constructs multiple weak

classifiers on weighted training graphs and each weak classifier is, in fact, a single sub-

graph. Yan et al. proposes to mine significant sub-graphs by using biased search which

exploits the correlation between structural similarity and significance similarity [81]. Ranu

and Singh propose a scalable method, GraphSig, to mine significant sub-graphs based on

a feature vector representation of graphs [56]. GraphSig uses domain knowledge to select

35

a meaningful feature set and prior probabilities of features are used to evaluate the signifi-

cance of sub-graphs in the feature space. This strategy can use existing frequent sub-graph

mining techniques to mine significant patterns in a scalable manner. Saigo et al. proposes

an iterative sub-graph mining method based on partial least squares regression named gPLS

[58]. This method is efficient because the weight vector is updated with elementary matrix

calculations. To support real-time graph query, some researchers try to index a small set

of sub-graphs to enable scalable query of graph databases [57, 67, 83, 84]. Sun et al. in-

troduces a graph search method for sub-graph queries based on sub-graph frequencies. In

addition, evolutionary computation has also been introduced in discriminative sub-graph

mining (GAIA) [35].

For above sub-graph feature selection methods, the graph data are required to fully la-

belled, whereas labeling graph data may be subject to expensive costs. An alternative solu-

tion is to combine both labeled and unlabeled graphs for sub-graph feature selection. Kong

et al. integrates active learning theory into feature and sample selection for graph classifi-

cation [41]. They maximize the dependency between sub-graph patterns and graph labels

by using an active learning framework and search the optimal graph to query for the label.

In addition, a feature evaluation criterion, gSemi, is derived to estimate the significance of

sub-graphs based on both labeled and unlabeled graphs [42], by combining semi-supervised

feature selection and sub-graph feature mining. For large graphs, a novel technique called

D-walks (Discriminative random walks) is developed to tackle semi-supervised classifica-

tion [9]. The class of unlabeled data can be predicted by maximizing the betweenness score

with labeled data. For applications without negative graph samples, PU learning for graph

classification is developed to focus on selecting useful sub-graph features based on positive

and unlabeled graph data only [87].

2.5 DATA STREAM MINING

Our problem is also related to data stream mining. The initial study of data stream mining

was proposed in [21], in which incremental learning was used to tackle the increasing

volumes in the data stream. Some ensemble learning-based methods [69, 77] have also

36

been used to address concept drift and data distribution changes in data streams. All these

frameworks only work for generic data with instance-feature representations, and there is

no feature immediately available for graph streams for learning and classification.

Several studies have investigated the graph stream classification problem. To the best

of our knowledge, only a few works [1, 15, 46] are directly related to our problem. These

methods apply hashing techniques to sketch the graph stream to save computational cost

and control the size of the subgraph-pattern set. In [1], Aggarwal proposes a 2-D random

edge hashing scheme to construct an “in-memory” summary for sequentially presented

graphs and uses a simple heuristic to select a set of most discriminative frequent patterns

for building a rule-based classifier. Although this method has demonstrated promising per-

formance on graph stream classification, it has two inherent limitations: (1) The selected

sub-graphs contain disconnected edges, which may have less discriminative capability than

connected subgraph-patterns because of structure meaning deficiency. (2) A frequent pat-

tern mining process is required to perform on the summary table that comprises massive

transactions, resulting in high computational cost. In [15], a clique hashing strategy is

used to improve computational efficiency for graph stream task, but the authors still use

traditional coarse-grained graph representation which causes severe information loss in

mapping process and further decreases the classification accuracy. And in [46], the au-

thors proposed a hash kernel to project arbitrary graphs onto a compatible feature space for

similarity computing, but this technique can only be applied to node-attributed graphs.

2.6 REAL-TIME ANALYSIS

As we mentioned above, real-time analysis is to capture the changes at each time period,

which is used to improve mining performance. Zhu and Shasha have proposed techniques

to compute some statistical measures over time series data streams [90]. The proposed

techniques use discrete Fourier transform. The system is called StatStream and is able to

compute approximate error bounded correlations and inner products. The system works

over an arbitrarily chosen sliding window. Lin et al. have proposed the use of symbolic

representation of time series data streams [48]. This representation allows dimensional-

37

ity/numerosity reduction. They have demonstrated the applicability of the proposed repre-

sentation by applying it to clustering, classification, indexing and anomaly detection. Chen

et al. have proposed the application of what so called regression cubes for data streams [11].

Due to the success of OLAP technology in the application of static stored data, it has been

proposed to use multidimensional regression analysis to create a compact cube that could

be used for answering aggregate queries over the incoming streams. This research has

been extended to be adopted in an undergoing project Mining Alarming Incidents in Data

Streams MAIDS. Himberg et al. have presented and analyzed randomized variations of

segmenting time series data streams generated onboard mobile phone sensors [29]. One of

the applications of clustering time series discussed: Changing the user interface of mobile

phone screen according to the user context. It has been proven in this study that Global

Iterative Replacement provides approximately an optimal solution with high efficiency in

running time.

2.7 ROADMAP

The overall roadmap of this thesis is given in Fig.2-2.

Figure 2-2: The overall roadmap of this thesis.

38

Chapter 3

Understanding the Roles of Sub-graph

Features for Graph Classification: An

Empirical Study Perspective

3.1 INTRODUCTION

The advancement of data acquisition and analysis technology has resulted in many appli-

cations involving complex structure data. Examples include Cheminformatics, Bioinfor-

matics, and Social Network Analysis. Different from traditional data that are represented

as feature vectors, the new data are often represented as graphs to denote the content of

the data entries and their structural relationships. This has resulted in an increasing interest

in graph classification, which tries to learn classification model from a number of labeled

graphs to separate previously unseen graphs into different categories. This problem, in

general, can be divided into two large branches: (1) Classification of nodes (or edges) in a

single large graph, like social networks; (2) classification of a set of small graphs, like drug

activity predictions. This experimental study focuses on the latter, i.e the classification of

small graphs.

Although graph representation has some inherent advantages, such as the number of

nodes and edges can vary to best capture the complex relationship between objects, the

39

Figure 3-1: An example of sub-graph pattern representation. Left panel shows two graphs,

G1 and G2 and right panel gives the two indicator vectors showing whether a sub-graph

exists in the graphs.

disadvantage is also obvious: Lack of learning algorithms which are able to support graph

structure data. Finding effective representations for graph data is therefore a major chal-

lenge for graph classification. One possible solution is to define some graph kernels [37]

that use graph structures, such as topology, paths, and edge labels of the graphs, to calculate

distance between a pair of graphs, and then use the distance to formulate a generic learning

algorithm. The second major solution is using substructure patterns (i.e. sub-graphs) as

features to represent each graph as a feature vector. So a training graph set can be con-

verted into a generic training instance set for learning. Fig. 3-1 shows a sub-graph pattern

representation example. More specifically, Sub-graphs g1, g2, and g3 are used to convert

graphs G1 and G2 into feature vectors. Depending on whether g1, g2, and g3 appear in each

respective graph, say G1, a binary feature vector X1 = [110...]T is created to represent

G1. subgraph-based approaches are useful in many graph related tasks, including discrim-

inating different groups of graphs, classifying and clustering graphs and building graph

indices in vector spaces. An inherent advantage of embedding graphs into vector space is

that it makes existing algorithmic tools developed for feature based object representations

available for graph structure data.

In this chapter, we mainly focus on sub-graph pattern mining based graph classifica-

tion. Frequent Sub-graph Mining (FSM) has been an emerging sub-graph mining problem,

40

mainly because sub-graph patterns are meaningful tokens to effectively map graphs into

vector space for clustering or classification. Given a graph dataset, D = {G0, G1, ..., Gn},

assume Supg denotes the number of graphs (in D) which include g as a sub-graph, the ob-

jective of FSM is to find all frequent sub-graphs whose number of occurrence is above the

specified threshold minsup (s.t. Supg ≥ minsup) from the given graph dataset D. Existing

FSM methods can be roughly divided into two categories: (i) Apriori-based approaches,

and (ii) pattern growth-based approaches. The Apriori-based approaches carry out pattern

mining in a generate-and-test manner by using a Breadth First Search (BFS) strategy to

explore the sub-graph lattice of the given database. Three well established Apriori-based

FSM algorithms include AGM, FSG, and DPMine [33, 22, 45, 75]. Pattern growth-based

approaches adopt a Depth First Search (DFS) strategy where, for each discovered sub-graph

g, the sub-graph is extended recursively until all frequent super-graphs of g are discovered.

FSM algorithms that adopt a DFS strategy tend to need less memory because they traverse

the lattice of all possible frequent sub-graphs in a DFS manner. Two well-known example

algorithms are gSpan and FFSM [32, 82].

Given a set of frequent sub-graphs g1, g2, ..., gd, a graph Gx can then be represented as

a feature vector X = [x1, x2, ..., xd]T , where xi = 1 if gi ⊆ Gx; otherwise, xi = 0 [81].

Existing research has demonstrated that such vectorization helps to build efficient indices

to support fast graph search [83]. In addition, this representation can also help to preserve

some basic structural information for graph data analysis.

Several reasons motivate the proposed empirical studies on frequent sub-graphs for

graph classification. Firstly, the mining of the sub-graphs involves graph matching op-

erations that are computationally demanding. Indeed, just testing whether a graph is a

sub-graph of another graph is an NP-complete problem [74]. In Fig. 3-2, we report the

sub-graph mining runtime with respect to an increasing number of edges of the sub-graphs

for a given graph set. The results show that it is very time-consuming to match patterns in

graph datasets, especially for sub-graphs with a large number of edges. Secondly, we are

interested in finding discriminative power of sub-graphs. In other words, we want to know

how much the classification accuracy can be affected by different properties of the selected

sub-graphs, such as different number of nodes, different number of edges, and the size of

41

Figure 3-2: The runtime of frequent sub-graph pattern mining with respect to the increasing

number of edges of sub-graphs.

the selected sub-graph set. To achieve the goal, a typical objective function, information

gain (IG) [36], is used to test the discriminative power of attributions (sub-graphs). Ex-

periments show that the objective function is neither monotonic nor anti-monotonic with

respect to the size of the sub-graphs [81]. In addition, the internal structural correlation

between sub-graphs can also impact on the classification result. For instance, sub-graphs

with similar structures tend to have similar objective scores, whereas using sub-graphs with

high correlations for classification should be avoided because it will introduce redundant

features with high dependency that may deteriorate the classification accuracy. This fact

can help to select better sub-graphs to represent graph datasets. Thirdly, most given graph

classification methods are computationally expensive, we are wondering whether there is

any inexpensive way to achieve graph classification. More specifically, if we use some ran-

dom sub-graphs to represent a graph dataset, how good will the classification accuracy be,

compared to the approaches that use sophisticated sub-graph mining algorithms?

In this chapter, we present an empirical comparison of graph classification based on

different sets of sub-graph features, including frequent sub-graphs, random sub-graphs,

and sub-graphs selected using information gain measure. The comparisons are carried out

by using different number of sub-graphs, and sub-graphs with different edges. Experiments

are performed on seven real-world graph datasets from three domains, by using two popular

learning algorithms including support vector machines and nearest neighbour. The studies

42

show that using random sub-graph features has a reasonably good performance compared

to its expensive peers which use frequent sub-graphs. Experiments also show that sub-

graphs with many edges are not necessarily good candidates for graph classification.

The remainder of the chapter is organized as follows. Section 3.2 formulates the prob-

lem and defines some important notations. Experimental studies are reported in Section 3.3.

3.2 PROBLEM FORMULATION

In this section, we introduce several basic concepts related to graph representation, frequent

sub-graph mining, and graph classification.

3.2.1 Graph and Sub-graph

DEFINITION 5 Graph: LV and LE are finite label sets which denote nodes and edges,

respectively. A graph g = (V,E, μ, ν), where:

• V denotes a finite set of nodes,

• E ⊆ V × V is the set of edges,

• μ : V → LV is the node labeling function, and

• ν : E → LE is the edge labeling function.

The number of nodes of a graph g is denoted by |g|, while D represents the set of all

graphs over the label alphabets LV and LE .

DEFINITION 6 Sub-graph: Let g1 = (V1, E1, μ1, ν1) and g2 = (V2, E2, μ2, ν2) be two

graphs, Graph g1 is a sub-graph of g2, denoted by g1 ⊆ g2, if

• V1 ⊆ V2,

• E1 ⊆ E2,

• μ1(v) = μ2(v) for all v ∈ V1, and

43

• ν1(e) = ν2(e) for all e ∈ E1.

An example is shown in Fig. 3-3, where the label of each node is color coded (the same

color means the same label). Graph (b) is a sub-graph of (a).

Figure 3-3: A conceptual view of graph vs. sub-graph. (b) is a sub-graph of (a).

3.2.2 Frequent Sub-graph Mining

DEFINITION 7 Support: Given a graph database D, the support of a sub-graph g, de-

noted by Supg, is the fraction of the graphs in D of which g is a sub-graph, formally:

Supg =|{g′ ∈ ζ|g ∈ g′}|

|D| ,

Given a user specified minimum support threshold minsup and a graph database D, a

frequent sub-graph is a sub-graph whose support is at least minsup (i.e. SupD ≥ minsup)

and the frequent sub-graph mining problem is to find all frequent sub-graphs in D.

DEFINITION 8 Graph Isomorphism: There are two graphs g1 = (V1, E1, μ1, ν1) and

g2 = (V2, E2, μ2, ν2), a graph isomorphism is a bijective function f : V1 → V2 satisfying

• μ1(v) = μ2(f(v)) for all nodes v ∈ V1,

• for each edge e1 = (v1, v2) ∈ E1, there exists an edge e2 = (f(v1), f(v2)) ∈ E2 such

that ν1(e1) = ν2(e2), and

• for each edge e2 = (v1, v2) ∈ E2, there exists an edge e1 = (f−1(v1), f−1(v2)) ∈ E1)

such that ν1(e1) = ν1(e2).

Two graphs are called isomorphic if there exists an graph isomorphism between them.

44

Figure 3-4: An example of graph isomorphism.

DEFINITION 9 Sub-graph Isomorphism: Given two graphs g1 = (V1, E1, μ1, ν1) and

g2 = (V2, E2, μ2, ν2), an injective function f : V1 → V2 from g1 to g2 is a sub-graph

isomorphism if there exists a sub-graph g ⊆ g2 such that f is a graph isomorphism between

g1 and g.

3.2.3 Graph Classification

DEFINITION 10 Graph Embedding: Let D be a set of graphs. A graph embedding is a

function ϕ : D → Rn mapping graphs to n-dimensional vectors, i.e.,

ϕ(g) = (x1, x2, ..., xn)′,

In Table 3.1, we summarize the advantages and disadvantages of graph representation

comparing with vector representation. Embedding graphs into vector spaces makes existing

algorithmic tools developed for feature based object representations immediately available

for graph classification.

After mapping graph data into vector space, graph classification is almost the same as

learning and classifying generic instances represented in the feature space.

3.3 EXPERIMENTAL STUDY

In this section we first discuss the benchmark data and the experimental settings, and then

report detailed experimental results and analysis.

45

Table 3.1: The advantages and disadvantages comparisons between vector representation

vs. graph representation

3.3.1 Benchmark Data

We carry out our experimental studies on seven real-world graph datasets, which are col-

lected from three different domains: Chemical compound, Protein structures, and Citation

network.

Chemical Compound: The activity of chemical compound molecules can be predicted by

their special 3-dimensional structures. In our experiments, we use a series of binary-label

graph datasets from the PubChem website1. This website provides real-world datasets on

the biological activities of small molecules, containing the bioassay records for anti-cancer

screen tests with different cancer cell lines collecting from National Cancer Institute (NCI).

Each dataset belongs to a certain type of cancer screen with active or inactive response (i.e.

class labels) [81]. Because each NCI bioassay dataset contains very few active graphs, we

use under-sampling to down-sample inactive graphs to form a relatively balanced dataset

for performance evaluation. The number of vertices in most of those compounds ranges

from 10 to 200. We use 5 graph datasets in the experiments and the information about

these data are reported in Table 3.2.

Protein Structure: Proteins are organic compounds made of amino acids sequences joined

together by peptide bonds. A huge amount of proteins have been sequenced over years, and

the structures of thousands of proteins have been resolved so far. The well known role of

proteins in the cell is as enzymes which catalyze chemical reactions [38]. The D&D dataset

we used contains 1178 protein structures that can be divided into two classes: 691 enzymes

1http://pubchem.ncbi.nlm.nih.gov

46

Table 3.2: NCI datasets used in experiments

Datasets Data SizeVertex

Size

Edge

Size

# Node

Labels

# Edge

Labels

NCI33 2934 30.2 32.5 39 3

NCI410 2881 21.5 30.1 42 3

NCI2302 13764 23.9 27.7 38 3

NCI489028 11092 33.5 30.7 42 3

NCI485346 18426 29.5 31.4 44 3

and 487 non-enzymes [20]. Each protein is represented by a graph, in which the nodes

are amino acids and two nodes are connected by an edge if they are less than 6 Angstroms

apart. These proteins, with an average size of 285 vertices and 716 edges, are larger and

stronger connected than molecules from the NCI screening.

Table 3.3: DBLP dataset used in experiments

Classes Descriptions # Papers # Graphs

DBDM

SIGMOD,VLDB,ICDE, EDBT,PODS,

DASFAA,SSDBM,CIKM,DEXA,

KDD, ICDM, SDM, PKDD, PAKDD

20601 9530

CVPRICCV, CVPR, ECCV, ICPR, ICIP,

ACM Multimedia, ICME18366 9926

DBLP Citation Network: The DBLP dataset is composed of bibliography data in the field

of computer science2. Each record in DBLP is associated with a number of attributes such

as paper ID, authors, years, title, abstract and reference ID [71]. We build a binary-label

graph dataset by using papers published in a list of conferences (as shown in Table 3.3).

The classification task is to predict whether a paper belongs to the field of DBDM (database

and data mining) or CVPR (computer vision and pattern recognition), by using references

and title of each paper. In our experiments, each paper in DBLP is represented as a graph,

where each node denotes a Paper ID or a keyword and each edge denotes the citation

relationship between papers or keywords appeared in the paper’s title. More specifically,

2http://arnetminer.org/citation

47

we denote that (1) each paper ID is a node; (2) if a paper A cites another paper B, there is an

edge between A and B; (3) each keyword in the title is also a node; (4) each paper ID node

is connected to the keyword nodes of the paper; and (5) for each paper, its keyword nodes

are fully connected with each other. An example of DBLP graph data is shown in Figure

3-5, where the rectangles are paper ID nodes and diamonds are keyword nodes. The paper

ID17890 cites (connects) paper ID17883 and ID18068, and ID17890 has keywords Patch,

Motion, and Invariance in its title. Paper ID18068 has keyword Edge and Detection, and

paper ID17883’s title includes keywords Vision and Edge. For each paper, the keywords in

the title are linked with each other.

Figure 3-5: Graph representation for a paper (ID17890) in DBLP. Node in red is the main

paper. Nodes in black ellipse are citations. While nodes in black box are keywords.

3.3.2 Sub-graph Features

To understand the relationship between sub-graph features and the classification accuracy,

we carry out sub-graph feature selection by using different sizes of sub-graphs (with respect

to the number of edges), which are from one to nine (sub-graphs with more than nine edges

are much less frequent). In Table 3.4, we report the number of sub-graph features with

respect to different sizes (i.e. number of edges) in the benchmark datasets. Meanwhile, we

also vary the number of select sub-graph features (i.e. the size of the “selected feature set”)

to include 50, 1000, and 2000 sub-graphs respectively, in all experiments.

To select sub-graph features, we use frequency and Information Gain (IG) based feature

selection criteria, respectively. For frequency based criterion, we simply select sub-graphs

48

Table 3.4: Number of sub-graphs with respect to different sizes (i.e. number of edges)

Datasets

#Edges1 2 3 4 5 6 7 8 9

NCI33 134 437 1361 >2000 >2000 >2000 >2000 >2000 >2000

NCI410 53 174 1512 >2000 >2000 >2000 >2000 >2000 >2000

NCI2302 140 449 1706 >2000 >2000 >2000 >2000 >2000 >2000

NCI489028 48 417 1298 >2000 >2000 >2000 >2000 >2000 >2000

NCI485346 39 140 562 1172 >2000 >2000 >2000 >2000 >2000

D&D 190 1356 >2000 >2000 >2000 >2000 >2000 >2000 >2000

DBLP >2000 >2000 >2000 >2000 >2000 >2000 >2000 >2000 >2000

with the highest frequency. For IG based approach, we calculate Information Gain (IG)

between each sub-graph and the class label, and select the ones with the highest IG scores

as sub-graph features. It is worth noting that IG is commonly used for sub-graph feature

selection and most significant patterns likely fall into the high-quantile of frequency [81].

In addition to frequency and Information Gain based sub-graph features, we also em-

ploy a random feature selection approach. In our experiments, we first collect all one-edge

sub-graphs from the training set. After that, multiple-edge sub-graphs are generated by

randomly combining one-edge sub-graphs. If a multiple-edge sub-graph appears more

than once in training set, this sub-graph will be considered as a possible random sub-graph

feature. This process is equivalent to random sub-graphs collecting from the whole struc-

ture space. Because our experiments focus on sub-graph features with different edges, we

generate a sub-graph with X edges by using X one-edge sub-graphs selected in a random

manner (X = 1, 2, · · · , 9). This process is much more efficient and more feasible than

generating all possible features from training set (which is computationally infeasible).

Among all four benchmark datasets, DBLP is a special graph dataset. The node space in

DBLP is very sparse, because there are thousands of nodes with different labels (as shown

in Table 3.3) (the nodes in DBLP represent paper ID and paper keywords). As a result,

the most discriminative features in DBLP are one-node sub-graphs. So in our experiments,

we also include sub-graphs containing only one-node (0-edge features mean one-node sub-

graphs).

49

3.3.3 Experimental Settings

In our experiments, we use classification accuracy to measure the performance of the al-

gorithm. Suppose the true label set of testing data is Lt, and the Label set returned by

our algorithm is Lr. The accuracy is defined as |Lt ∩ Lr|/|Lr|. In our experiments, We

use 10-fold cross-validation to evaluate and compare the algorithm’s performance. Each

graph dataset is evenly partitioned into 10 parts. Only one part is used as test set and the

other nine parts are used for sub-graph mining, sub-graph feature selection, and classifier

generation. To reduce variability, the validation was collected from the average of 10 times

cross-validation. To train classifiers from graph data, we use Support Vector Machines

(Lib-SVM based MATLAB package3) and Nearest Neighbor algorithm (NN) [27]. All

experiments are conducted on machines with 4GB RAM and Intel CoreTM i5 CPUs of

3.10 GHz.

3.3.4 Results and Analysis

In this section we report the graph classification accuracy with respect to four major fac-

tors: (1) the sub-graph feature sizes; (2) the size of the sub-graph feature set; (3) different

learning algorithms; and (4) different benchmark datasets. In Figures 3-6, 3-7, and 3-8, we

report the algorithm performance with respect to different sizes of sub-graph features, by

using Support Vector Machines (Figures 3-6 and 3-7) and Nearest Neighbour (Figure 3-8)

learning algorithms. More specifically, in Fig. 3-6, figures in each row correspond to one

benchmark dataset. The left, middle, and the right panel each corresponds to 50, 1000, and

2000 sub-graph features, respectively. For each figure, the x-axis denotes the size (i.e. the

number of edges) of the selected sub-graph features, and the y-axis denotes the classifica-

tion accuracy. While in Fig. 3-7, figures in each row correspond to one benchmark dataset.

The left, middle, and the right panel each corresponds to 50, 1000, and 2000 sub-graph fea-

tures, respectively. For each figure, the x-axis denotes the size (i.e. the number of edges)

of the selected sub-graph features, and the y-axis denotes the classification accuracy. Each

curve in the figure corresponds to the classification accuracies using sub-graph features se-

3http://www.csie.ntu.edu.tw/�cjlin/libsvm

50

lected from different approaches. Each curve in the figure corresponds to the classification

accuracies using sub-graph features selected from different approaches. In Fig. 3-8, figures

in each row correspond to one benchmark dataset. The left, middle, and the right panel

each corresponds to 50, 1000, and 2000 sub-graph features, respectively. For each figure,

the x-axis denotes the size (i.e the number of edges) of the selected sub-graph features,

and the y-axis denotes the classification accuracy. Each curve in the figure corresponds to

the classification accuracies using sub-graph features selected from different approaches.

In each figure, each row corresponds to one benchmark dataset. The left, middle, and the

right panel each corresponds to the results of using 50, 1000, and 2000 sub-graph features,

respectively. The x-axis in each figure denotes the size (i.e the number of edges) of each

sub-graph features, and the y-axis denotes the classification accuracy. Each curve corre-

sponds to the classification accuracies using sub-graph features selected from random sub-

graph features, frequency based sub-graph features, and Information Gain based sub-graph

features, respectively.

The results in Figures 3-6, 3-7, and 3-8 suggest the following major findings.

The discriminative power of sub-graphs varies by their sizes: The results show that the

size of sub-graph features has significant impact on its discriminative power. For NCI data,

features with 4 to 7 edges have good classification performance. For D&D protein data,

features with 1 to 3 edges are better than others. For DBLP data, features with one-node

are more discriminative. The variance of the discriminative power of sub-graphs is mainly

attributed to the domain characteristics of the graph data:

• NCI graphs have very similar structures between each other. Examples include ben-

zene rings, which can be observed in many graphs. Each NCI dataset directly focuses

on a special type of compounds (or structures). So most NCI data are very similar in

structures.

• Although D&D graphs only have less than 44 node labels, these graphs have very

complex structures, and the distribution of the data is more scattered than NCI data.

Because there are up to thousands of enzymes found in the biosphere. The large

structure differences result in less graphs sharing the same sub-graphs with relatively

51

1 2 3 4 5 6 7 8 90.55

0.6

0.65

0.7

0.75NCI33 with 50 features

Number of selected edges

Acc

urac

y %

RandomFrequentIG

(a)

1 2 3 4 5 6 7 8 9

0.65

0.7

0.75



Acc

urac

y %

RandomFrequentIG

(b)

1 2 3 4 5 6 7 8 90.64

0.66

0.68

0.7

0.72

0.74

0.76



Acc

urac

y %

RandomFrequentIG

(c)

1 2 3 4 5 6 7 8 90.5

0.6

0.7

0.8



Acc

urac

y %

RandomFrequentIG

(d)

1 2 3 4 5 6 7 8 90.66

0.68

0.7

0.72

0.74

0.76

0.78



Acc

urac

y %

RandomFrequentIG

(e)

1 2 3 4 5 6 7 8 90.66

0.68

0.7

0.72

0.74

0.76

0.78



Acc

urac

y %

RandomFrequentIG

(f)

1 2 3 4 5 6 7 8 90.61

0.62

0.63

0.64

0.65

0.66



Acc

urac

y %

RandomFrequentIG

(g)

1 2 3 4 5 6 7 8 90.62

0.63

0.64

0.65

0.66

0.67

0.68



Acc

urac

y %

RandomFrequentIG

(h)

1 2 3 4 5 6 7 8 90.62

0.64

0.66

0.68



Acc

urac

y %

RandomFrequentIG

(i)

1 2 3 4 5 6 7 8 90.45

0.5

0.55

0.6

0.65



Acc

urac

y %

RandomFrequentIG

(j)

1 2 3 4 5 6 7 8 90.6

0.62

0.64

0.66

0.68

0.7

0.72



Acc

urac

y %

RandomFrequentIG

(k)

1 2 3 4 5 6 7 8 90.6

0.62

0.64

0.66

0.68

0.7

0.72



Acc

urac

y %

RandomFrequentIG

(l)

1 2 3 4 5 6 7 8 90.54

0.56

0.58

0.6

0.62



Acc

urac

y %

RandomFrequentIG

(m)

1 2 3 4 5 6 7 8 90.54

0.56

0.58

0.6

0.62



Acc

urac

y %

RandomFrequentIG

(n)

1 2 3 4 5 6 7 8 9

0.57

0.58

0.59

0.6

0.61

0.62

0.63

NCI489028 with 2000 features


Acc

urac

y %

RandomFrequentIG

(o)

Figure 3-6: Classification accuracy on five NCI chemical compound datasets with respect

to different sizes of sub-graph features (using Support Vector Machines: Lib-SVM).

52

1 2 3 4 5 6 7 8 90.5

0.55

0.6

0.65

0.7D&D with 50 features


Acc

urac

y %

RandomFrequentIG

(a)

1 2 3 4 5 6 7 8 90.6

0.62

0.64

0.66

0.68

0.7

0.72



Acc

urac

y %

RandomFrequentIG

(b)

1 2 3 4 5 6 7 8 9

0.65

0.7

0.75



Acc

urac

y %

RandomFrequentIG

(c)

0 1 2 3 4 5 6 7 80.4

0.5

0.6

0.7

0.8

0.9DBLP with 50 features


Acc

urac

y %

RandomFrequentIG

(d)

0 1 2 3 4 5 6 7 80.4

0.5

0.6

0.7

0.8



Acc

urac

y %

RandomFrequentIG

(e)

0 1 2 3 4 5 6 7 80.4

0.5

0.6

0.7

0.8



Acc

urac

y %

RandomFrequentIG

(f)

Figure 3-7: Classification accuracy on D&D protein dataset and DBLP citation dataset with

respect to different sizes of sub-graph features (using Support Vector Machines: Lib-SVM).

large size. As a result, the most discriminative features of D&D graphs have 1 to 3

edges on average.

• DBLP data is the most sparse graph datasets. There are thousands unique paper

IDs and keywords in the DBLP (each paper ID and keyword represent one node).

Because there are a very large number of node labels, it is hard to find shared sub-

graphs with more than 4 edges from any two DBLP graphs. The most frequent

and the most discriminative features we found are, in fact, one-node, and sub-graph

features with more than 3-edge may only appear very few times. As a result, when

using features with 3 or more edges to build classifiers, the classification accuracy is

about 50% (which is equivalent to random predictions) (as shown in Figure 3-7 (d),

(e), and (f)).

Overall, the experiments suggest that the actual discriminative power of sub-graph varies,

depending on their sizes and the domain of the graph data. Sub-graphs with many edges

are not necessary to be considered for graph classification.

53

1 2 3 4 5 6 7 8 90.45

0.5

0.55

0.6

0.65



Acc

urac

y %

RandomFrequentIG

(a)

1 2 3 4 5 6 7 8 90.55

0.6

0.65

0.7



Acc

urac

y %

RandomFrequentIG

(b)

1 2 3 4 5 6 7 8 90.55

0.6

0.65

0.7



Acc

urac

y %

RandomFrequentIG

(c)

1 2 3 4 5 6 7 8 90.45

0.5

0.55

0.6

0.65



Acc

urac

y %

RandomFrequentIG

(d)

1 2 3 4 5 6 7 8 90.55

0.6

0.65

0.7



Acc

urac

y %

RandomFrequentIG

(e)

1 2 3 4 5 6 7 8 9

0.65

0.7

0.75



Acc

urac

y %

RandomFrequentIG

(f)

1 2 3 4 5 6 7 8 90.4

0.5

0.6

0.7

0.8



Acc

urac

y %

RandomFrequentIG

(g)

1 2 3 4 5 6 7 8 90.4

0.5

0.6

0.7

0.8



Acc

urac

y %

RandomFrequentIG

(h)

1 2 3 4 5 6 7 8 90.4

0.5

0.6

0.7

0.8



Acc

urac

y %

RandomFrequentIG

(i)

Figure 3-8: Classification accuracy on one NCI chemical compound dataset, D&D protein

dataset, and DBLP citation dataset with respect to different sizes of sub-graph features

(using Nearest Neighbours: NN).

Random sub-graphs have a reasonably good performance: compared to sub-graphs

discovered from expensive sub-graph mining and selection criteria, such as frequency based

sub-graphs and information gain based sub-graphs, the classification accuracy of random

sub-graph are reasonably good. Our experiments show that none of the feature selection

approaches can significantly outperform all other methods (including random sub-graph)

on all seven benchmark graph datasets. Although the accuracy of the information gain

based method is often higher than others, in most cases, the difference of accuracy between

random feature selection and frequent feature selection with IG is less than 5%.

Random sub-graphs performs well on graph classification tasks because current graph

54

classification or clustering methods are always based on binary sub-graph representation

and combine with traditional vector-based methods to do classification or clustering. As

a result, this so-called coarse-grained representation model cannot well capture the real

structure of graphs and cause information loss. Different sub-graph selection methods

hardly make up such loss caused by the binary representation mechanism. We will give

a novel

Overall, our experiments suggest that random sub-graphs are useful for graph classifi-

cation. The classifiers built from random sub-graphs are not significantly inferior to other

peers.

Number of sub-graphs is important to ensure good performance: When comparing

three figures in each row (the left, middle, and the right panel represent results correspond-

ing to 50, 1000, and 2000 sub-graph features, respectively), it is clear that increasing the

number of sub-graph features often results in an improved accuracy. For example, when

increasing the number of sub-graphs from 50, to 1000, and to 2000, the average classifica-

tion accuracy of the classifiers trained by using three types of sub-graph features (Random,

Frquent, and IG) for 9-edge features on NCI410 graph dataset will increase from 70%,

73% to 75%. Similar trends can also be observed from other benchmark datasets. For

each specific selected sub-graph feature subset (i.e. a feature set with 1000 sub-graphs),

the accuracies will vary significantly, depending on the size of the sub-graphs in the set. A

larger improvement can be observed for sub-graph feature subset containing more features.

For example, when using 2000 sub-graphs, the difference between the maximum and the

minimum classification accuracies is significantly larger than the difference of using 50

sub-graphs (as shown in Figure 3-6 (d) vs. (f)).

Overall, our experiments suggest that the number of sub-graph is important for achiev-

ing good classification accuracy. High accuracy can be expected if a good number of

sub-graphs are combined with good sizes of sub-graphs.

Increasing number of sub-graphs reduces the difference between classifiers built from

different sub-graphs: The results from all benchmark datasets show that when increasing

the number of sub-graphs, the overall prediction accuracy will also increase (for example,

Figure 3-6 (a), (b) , and (c)). For a small number of sub-graphs, e.g the feature size is 50,

55

the accuracies of the classifiers trained from different sub-graphs are largely different, and

random sub-graphs are inferior to other approaches (as shown in 3-6 (a) , (d) , (g) , (j) ,

and (m), where the differences of accuracy among three methods are from 0% to 17.5%).

When the size of feature set increases to 2000, the accuracies of all three classifiers are very

close to each other. In some cases, the three curves are found to exactly the same, which

suggest that the performance of using random sub-graphs will steadily improve when an

increasing number of sub-graphs is used to train the classifier. Indeed, when a large number

of sub-graph are used, the difference between different feature set will be reduced. In an

extreme case, if all sub-graphs are used to train classifiers, there is no difference between

feature sets, which will therefore result in the same classifier performance.

Overall, our experiments suggest that the difference between classifiers trained by using

different sub-graphs is mostly noticeable when the number of sub-graphs is small. Such

difference will be reduced when the number of sub-graphs increases, and may vanish when

a relative large number sub-graphs, say more than 1000, is used. A direct conclusion from

this observation will suggest that when using more than 1000 sub-graph features to train

classifiers, there is not much difference between using random sub-graphs or using frequent

sub-graphs mined and selected by other expensive methods.

Performance with respect to different classifiers shows similar trend: In Figure 3-8,

we report the results of using Nearest Neighbour (NN) classifier on the benchmark data.

Due to space limitations, we only select one NCI graph dataset. Compared with Support

Vector Machines, we can find that the overall accuracies of NN are slightly lower than

SVM. However, the overall trends, such as the increasing/decreasing of the classification

accuracies with respect to the size of sub-graphs (or the number of sub-graphs), are very

similar to the results from SVM. The observations we made from SVM are mostly valid

for NN as well.

Overall, our experiments suggest that there is minor difference between the classifiers

trained by using different learning methods, whereas the relationship between the classifier

and the sub-graphs is mostly the same for different learning algorithms.

In this chapter, we found that current graph classification or clustering methods are

always based on binary sub-graph representation and combine with traditional vector-based

56

methods to do classification or clustering. As a result, number of sub-graphs and sub-graphs

selection methods play a huge influence on afterward graph classification or clustering

performances. And what is more, this so-called coarse-grained representation model cannot

well capture the real structure of graphs and cause information loss. So in next chapter,

we will propose a new fine-grained representation model to enjoy a rigorous theoretical

foundation that ensures minimal information loss for graph representation, and also avoids

the expensive sub-graph isomorphism validation process, which is proven to be NP-hard.

57

58

Chapter 4

Graph Hashing and Factorization for

Fast Graph Stream Classification

4.1 INTRODUCTION

As discussed in last chapter, number of sub-graphs and sub-graphs selection methods play a

huge influence on afterward graph classification or clustering performances. What is more,

unlike traditional instance-feature representations, graph-structure data do not have specific

features to represent the graph, so an essential step in building a graph classification model

is to explore graph features to represent graph data in an instance-feature space for effective

learning [57]. The majority of existing graph classification models employ an occurrence-

based feature representation model, in which a set of sub-graph features is selected to

represent the graph data by using the occurrence of the sub-graph in the graph (either 0/1

occurrence or actual number of occurrences) as the feature value. In this chapter, we refer

to such an occurrence based representation model as coarse-grained representation model.

An example is shown in Fig. 4-1, where the substructures g1, g2, and g3 are used to represent

Graph Gi. While the coarse-grained feature representation model has been popularly used

for graph classification, it has a number of disadvantages, including:

• Computation burden for isomorphism validation: Because sub-graphs are se-

lected as features to represent the gra[LP0POph, the expensive sub-graph mining and

59

Figure 4-1: Coarse-grained vs. fine-grained representation.

isomorphism validation process can not be avoided (graph isomorphism is proven to

be NP-hard) [19]. After the sub-graph features have been selected, the isomorphism

validation process has to be applied again to map the graphs into the vector space.

• Severe and unbounded information loss: Because the coarse-grained represen-

tation model does not characterize the degree of closeness of the sub-graph pattern

related to the graph, it suffers severe information loss in representing the graph. More

importantly, the information loss incurred in the representation is unbounded so it is

hard to characterize how closely the sub-graph features can represent the underlying

graph and what is the degree of information loss incurred in the representation.

For a graph stream with continuously growing volumes (i.e. number of graphs) and

changing structures (e.g. new nodes may appear in the coming graphs), the disadvantage of

the existing coarse-grained graph feature model is even greater. This is because (1) stream

volumes continuously increase, which makes it computationally expensive to explore sub-

graph features, and (2) the changes in the graph stream (such as new structures) make

the sub-graph features incapable of representing graphs. For example, if new nodes or

structures appear in the graph stream, the previously discovered sub-graph patterns will not

be able to represent the graphs, because sub-graph patterns do not contain the information

of new nodes and new structures. Although it is always possible to rediscover new patterns

from the most recent data, the mining procedures are normally time-consuming, and any

features discovered may soon be outdated and incapable of representing future graphs.

60

Motivated by the above observations, we propose a fine-grained graph factorization

framework for efficient graph stream classification in this chapter. Being fine-grained, our

mapping framework relies on a set of discriminative frequent cliques, instead of important

sub-graph patterns, to represent the graph data. In addition, such a fine-grained representa-

tion ensures that the final instance-feature representation is sufficiently close to the original

graph, with theoretical guarantee.

To solve the problem, our main idea is to bypass the expensive sub-graph mining pro-

cess and use linear combinations of graph cliques to represent graphs. This is similar to the

Fast Fourier Transform (FFT) [55] in which a set of base functions (i.e. cliques) is used to

form input signals (i.e. graphs). To achieve this goal, our algorithm relies on two important

steps to extract fine-grained graph representation: (1) finding a set of frequent graph cliques

as the base; and (2) using graph factorization to calculate a linear combination of the graph

cliques to best represent a graph.

Compared to the traditional coarse-grained representation model, the advantage of fine-

grained representation is clear. As shown in Fig.4-1, the original graphs are represented in

more detail by the use of fine-grained feature mapping: Given sub-graph features g1, g2,

and g3 and graphs G1 and G2, coarse-grained representation represents Gi as instance Xi

with 0/1 feature values, whereas fine-grained representation represents Gi as instance X ′i

with a fraction number indicating the closeness of the sub-graph gj to the graph Gi. Even

though g3 does not appear in G1, there is still a feature value (0.09). This is because our

framework no longer relies on the occurrences of the sub-graphs but on the linear combina-

tions of a number of base patterns to represent graphs. A clear advantage of our method is

that fine-grained graph representation enjoys a rigorous theoretical foundation that ensures

minimal information loss for graph representation, and also avoids the expensive sub-graph

isomorphism validation process, which is proven to be NP-hard. After applying our fine-

grained feature mapping model to represent graphs, we can use any learning algorithm,

such as Nearest Neighbor or Support Vector Machines, for graph stream classification. Our

method offers a number of advantages including a fast feature mining process, more precise

graph representation, and better performance for graph stream classification. Experiments

on two real-world network graph data demonstrate that our method outperforms state-of-

61

the-art approaches in both classification accuracy and runtime efficiency.

The remainder of this chapter is organized as follows. Section 4.2 introduces problem

definition. We describe the proposed graph factorization method in Section 4.3. Section 4.4

describes the system framework FGSC for graph stream classification, with experimental

studies reported in Section 4.5.

4.2 PROBLEM DEFINITION

In this chapter, we propose a fast graph stream classification method using graph factor-

ization to address the aforementioned problems and achieve better performance in both

classification accuracy and runtime efficiency.

DEFINITION 11 (Connected Graph): A graph is represented as G = (V , E ,L), where

V = {v1, v2, · · · , vnv} is the set of vertices, E ⊆ V × V denotes a set of edges, and

L = {l1, l2, · · · , lnl} is the set of symbol labels for vertices and edges. A connected graph

is a graph such that there is a path between any pair of vertices.

DEFINITION 12 (Adjacency Matrix): An adjacency matrix of G (which contains Q

unique nodes) is denoted by A(G) ∈ RQ×Q, where each entry aij denotes the weight of

the edge from vertex vi to vertex vj . And aii means the label on node vi.

We use the adjacency matrix as the graph representation and assume for simplicity that

each edge has a default weight 1. A graph Gi is a labeled graph if a class label yi ∈ Yis assigned to Gi. For the binary classification problem, we have yi ∈ Y = {−1,+1}. A

graph Gi is either labeled (denoted by GLi ) or unlabeled (denoted by GU

i ).

DEFINITION 13 (Clique): A clique c = (V ′, E ′,L′) in a graph G = (V , E ,L) is a sub-

graph of G, where V ′ ⊆ V , E ′ ⊆ E and L′ ⊆ L such that every two vertices in c are

connected by an edge.

DEFINITION 14 (Graph Stream): A graph stream S contains an increasing number of

graphs arriving in a streaming fashion with S = {G1, G2, · · · , Gi, · · · }. At any particular

62

time point, we can collect a batch of graphs (B) for analysis. Formally, S =∑∞

j=1 Bj ,

where Bj = {Gj1 , Gj2 , · · · , Gjkj}.

The aim of graph stream classification is to learn a classification model from S, at any

particular time point, by scanning the graph stream only once, and predicting the class

labels of future arrival graphs in the graph stream with maximal accuracy.

4.3 GRAPH FACTORIZATION

In this section, we present the model used to describe the relationship between cliques

and graphs, and propose a graph factorization algorithm to generate fine-grained graph

representation.

4.3.1 Factorization Model

Given a graph encoded by a binary symmetric adjacency matrix, our goal is to use dis-

criminative feature-patterns to represent the graph precisely. In this chapter, we propose to

use cliques as feature-patterns to represent graphs. There are several advantages of using

cliques as feature-patterns: (1) Finding cliques does not require a complicated sub-graph

mining process (which is computationally expensive). Although finding maximal cliques is

an NP-complete problem (similar to the sub-graph mining problem), the special structure

allows us to develop a fast algorithm to find cliques (details will be addressed in Section

5). (2) Because there is an edge between any two nodes vi and vj in a clique, there is no

need to describe the edges in the clique, so we can simply use all the nodes appearing in a

clique to form a vector to represent the clique.

We assume that the size of the feature set (which is a clique set), denoted by N , and the

size of graph node space, denoted by Q, are given. Our model is parameterized by wz and

mjz, where wz denotes the expected weight of clique cz, and mjz represents the occurrence

of vertex j in clique cz. The expected weight of the edge that lies between vertices i and

j in clique cz can then be re-written by mizwzmjz. If we sum up the cliques in a given

clique-pattern set C, where |C| = N , the expected weight of the edge that lies between

63

Figure 4-2: An example of graph factorization.

vertices i and j can be represented as

aij =∑cz∈C

mizwzmjz (4.1)

From Eq. (4.1), we have an expected adjacency matrix A as follows, where M is a

Q×N binary matrix and W is a N ×N diagonal matrix.

A(G) = MWMT (4.2)

In Eq.(4.2), each row of M means a node index which is the same as the node index

in A(G). Each column of M corresponds to the occurrences of the nodes in a selected

clique-pattern c ∈ C (Fig. 4-2 shows an example of M and A(G) of G1 in Fig. 4-1): Given

graph G1 and three cliques c1, c2, and c3, the clique set matrix (M) records whether a

node (row) appears in each of the three cliques (columns). Graph factorization uses M as

a base to find a diagonal matrix W1 which best approximates the adjacency matrix with

A(G1) ≈ A(G1) = MW1MT . The diagonal values in matrix W1 are used as feature

64

values to represent graph G1. More explicitly,⎛⎜⎜⎜⎜⎜⎜⎝a11 · · · a1Q

.... . .

...

aQ1 · · · aQQ

⎞⎟⎟⎟⎟⎟⎟⎠ = M

⎛⎜⎜⎜⎜⎜⎜⎝w1 · · · 0

.... . .

...

0 · · · wN

⎞⎟⎟⎟⎟⎟⎟⎠MT (4.3)

where aij is an element of A(G).

The decomposition of the adjacency matrix A(G) can be interpreted in terms of over-

lapping weight theory. More specifically, denote the ith column of M as M·i. We let

Θi = M·iMT·i , and Eq. (4.3) can then be rewritten as:

A(G) =N∑i=1

wiΘi (4.4)

We refer to Θi as the basic element. Because the ith column of M is a clique, Θi is

the binary matrix in which each element of the diagonal means a vertex of the graph that

does or does not appear in this clique and other nodes in the binary graph matrix represent

edge information in the clique. We use W ∈ RN×N to denote the clique interaction matrix,

which can also be considered as the clique degree matrix. As a result, Eq.(4.4) means that

a graph can be represented as a set of cliques with different weight values (wi). For a given

M, we can decompose a graph into a diagonal matrix W = [w1, w2, · · · , wN ]T , which

can be used to represent the graph. This is equivalent to mapping graphs into a vector

space. Unfortunately, in reality, M is a non-square matrix and Wk may not exist such that

A(Gk) = MWkMT for a given graph Gk.

Thus, we want to find an optimal Wk that satisfies,

A(Gk) ≈ MWkMT (4.5)

65

By using squared loss to measure the relaxation error, we have,

RE(A(Gk),W,M) = ||MWMT − A(Gk)||2

= ||A(G)− A(Gk)||2(4.6)

where RE(·) is the function of computing relaxation error. Then the optimization prob-

lem of Eq.(4.5) can be formulated as

Wk = argminW

RE(A(Gk),W,M) (4.7)

DEFINITION 15 (Fine-grain graph representation): Given a set of cliques c1, c2, . . . , cn

and a set of graphs G1, G2, . . . , Gm. For each graph Gk, if there is a vector Wk satisfied

Wk = argminW RE(A(Gk),W,M), we call Wk the Fine-grain representation for Gk

under c1, c2, . . . , cn.

4.3.2 Learning Algorithm

To solve Eq.(4.7), by combining Eqs.(4.2), (4.3), and (4.4), we have

Wk = argminW

||N∑i=1

wi · M·iMT·i − A(Gk)||2 (4.8)

Wk = argminW

||N∑i=1

wi ·Θi − A(Gk)||2 (4.9)

We define vec(·) as the matrix vectorized operator.

Wk = argminW

||N∑i=1

wi · vec(Θi)− vec(A(Gk))||2 (4.10)

Let P = [vec(Θ1), vec(Θ2), · · · , vec(ΘN)]. Then

Wk = argminW

||P · diag(W )− vec(A(Gk))||2 (4.11)

where diag(·) denotes the main diagonal of a matrix.

66

To solve Eq.(4.11), we employ the General Linear System Theorem proposed in [62].

THEOREM 1 (General Linear System Theorem): Let there exist a matrix B such that

By is a minimum 2-norm least-squares solution of a linear system Ax = y. Then it is

necessary and sufficient that B = A†, the Moore-Penrose generalized inverse of matrix A.

Remark. According to Theorem 1, we have the following properties for our proposed

graph mapping algorithm:

• The special solution x0 = A†y is one of the least-squares solutions of a general linear

system,

||Ax0 − y||2 = ||AA†y − y||2 = minx

||Ax− y||2

• The special solution x0 = A†y has the smallest 2-norm among all least-squares solu-

tions of Ax = y:

||x0|| = ||A†y|| ≤ ||x||,

∀x ∈ {x : ||Ax− y||2 ≤ ||Az − y||2, ∀z ∈ Rn}

• The minimum 2-norm least-squares solution of Ax = y is unique, which is x = A†y.

Based on Theorem 1, the smallest 2-norm least-squares solution of the above linear

system (Eq.(4.11)) is

diag(Wk) = P † · vec(A(Gk)) (4.12)

where P † is the Moore-Penrose generalized inverse of matrix P .

By using the above factorization results, we can use diag(Wk) to map each graph Gk

into a vector space, which is called fine-grained representation for graph Gk. Because

our factorization process ensures that the multiplication of Clique Set Matrix M and the

diagonal matrix W can best approximate the adjacency matrix of graph Gk with A(Gk) ≈A(Gk) = MWkMT , our method has the following three key advantages:

• Error Bound for Graph Representation: The difference between the original and

the approximated adjacency matrices, A(Gk) vs. A(Gk), quantitatively measures

67

the information loss of graph representation. We can further adjust the Clique Set

Matrix M to ensure an information loss upper bound. This provides a clear theoret-

ical base for graph representation, which the existing coarse-grained representation

cannot achieve.

• Avoid Sub-graph Pattern Mining: The inputs of the graph factorization process

include adjacency matrix A(Gk) and the Clique Set Matrix M. In the next section

we will show that finding the clique set matrix is much more efficient than finding

frequent sub-graph patterns. In addition, our method does not need sub-graph match-

ing for new graphs to generate vectors as traditional methods do. Because the Clique

Set Matrix M has already been generated from the training data, we only need to

use matrix operations to map the test graphs into a vector space, which is more effi-

cient than sub-graph matching. As a result, our approach avoids expensive sub-graph

pattern mining for efficient graph stream classification.

• Better Representation for Graph Streams: In graph stream scenarios, the graph

data and structures may change continuously. As a result, it is very difficult to find

representative substructures to represent graph streams (even if the computational

cost is not an issue). The fine-grained representation uses linear combinations of

base cliques to approximate each graph. Because cliques are basic graph units, which

remain relatively stable in graph streams, our approach provides a better presentation

framework for continuously changing graph streams.

4.4 FAST GRAPH STREAM CLASSIFICATION

In this section, we first introduce the overall framework for graph stream classification and

then address details of the method in respective subsections.

4.4.1 Overall Framework

The proposed graph stream classification framework (FGSC), is illustrated in Fig. 4-3,

which contains three key modules:

68

Figure 4-3: The framework of FGSC for graph stream classification.

• Clique Mining: To tackle continuously increasing graph stream volumes and ex-

panding graph nodes and structures, we collect graph data into batches with each

batch containing a number of graphs. For each single graph Gk in a batch, we use a

node hashing strategy to map the unlimited node space into a fixed-size node hashing

set, so our method can effectively handle graph streams with expanding new nodes

and structures. We then decompose the compressed graph into a number of cliques

by using the maximal clique representation method.

• Clique Set Matrix and Graph Factorization: The above clique mining process

may find an increasing number of clique-patterns, whereas we can only select a fixed

number of cliques as a base for factorization. Therefore, we propose a clique hashing

strategy to find discriminative frequent clique-patterns to form the Clique Set Matrix

(M). It is worth noting that the matrix M, in the graph stream scenarios, is discov-

ered from current and previous batches so we can find a good set of base cliques for

graph factorization, as shown in the dotted box in Fig. 4-3. As a result, all graphs in

the current batch Bi can be cast into a vector space through matrix factorization.

• Graph Stream Classification: After mapping the graphs into the vector space, a

69

classifier is trained from the current batch Bi. The graphs in the next chunk Bi+1

are collected as a test set (which is denoted by Gtest in Fig. 4-3). To ensure that the

vector space-based classifier properly classifies the test graphs, each graph in Bi+1 is

processed using the factorization module and is converted into vector space by using

generated W for classification.

4.4.2 Graph Clique Mining

As shown in Fig. 4-3, instead of relying on expensive frequent sub-graph mining to capture

graph features, we propose to use clique-based graph factorization to represent the graph

data. Our first step is to select a set of cliques from each graph in the stream. In stream sce-

narios, the node set of the graph stream can be extremely large and unlimited. In addition,

new nodes and structures may erupt and appear in the graphs, and it is therefore necessary

to compress each incoming graph into a fixed node space. We use a random hash function

to map the original unlimited node set onto a significantly compressed node set Ω of size

Q. In other words, nodes in the compressed graph of Gk will be re-indexed by {1, · · · , Q}.

Since multiple nodes in Gk may be hashed onto the same index, the weight of one edge

in the compressed graph is set to the number of edges between the original nodes that are

hashed onto the compressed nodes. After hashing all nodes into a fixed-size space, each

graph Gk ∈ S is transferred into a compressed graph G∗k for clique mining. We define

this node hashing strategy using G∗k := NodeHash(Gk, Q). An example of graph clique

mining is illustrated in Fig. 4-4, where step (B) shows the compressed graph.

To discover cliques from compressed graphs, we propose the following fast approach.

According to the graph compression results, the weight of one edge in the compressed

graph denotes the number of edges that exist in the original graph. Our clique mining

process can take this information into consideration. More specifically, our maximal clique

set mining process includes:

• Finding Maximal Cliques: We first discover any maximal cliques, CG∗k, from the

compressed graph G∗k (where no maximal cliques in CG∗

kshare any overlapping

edge). Because all nodes in compressed graph G∗k are unique, finding maximal

70

11 2

12 1

2

2

1

Node Hashing

G Compressed G

1

11

11

11

1

11

11

11 1

1

1

1

1 20

1 01

1

0

11

1

1 20

0 00

0

0

1

0 00

0 00

0

0

2

c1 c2 c3 c41

1

(A) (B) (C) (D) (E)

Figure 4-4: An example of clique mining in a compressed graph.

cliques from G∗k is very efficient (we use the Bron-Kerbosch maximal clique find-

ing algorithm in our experiments). An example of the maximal clique c1 is shown in

Fig. 4-4 (C) upper panel;

• Clique Extraction: We set the weight of each edge in the discovered maximal clique

c ∈ CG∗k

equal to the smallest weight values of the edges in c. After that, we decrease

the weights of the edges in compressed graph G∗k by the weight of the corresponding

edges in CG∗k

and generate a new weighted graph G′k. An example of the updated

graph G′k, after extracting clique c1 is shown in Fig. 4-4 (C) lower panel;

• Loop: If the weight of any edge in the updated graph G′k is 0, it means that the edge

has already been removed, and no longer needs to be considered in the following

clique mining process. Continuing the loop between the above two steps will dis-

cover all cliques until all weight values of edges in G′k are equal to 0. An example is

shown in Fig. 4-4 (E) lower panel.

It is worth noting that because all nodes in the compressed graph G∗k are unique, graph

G∗k can completely restored (i.e. rebuilding G∗

k by using all discovered cliques). This en-

sures that there is no information loss during graph decomposition. Algorithm 1 describes

the detailed process of clique mining, where MaximalCliques(G′k) finds the maximal

71

cliques from G′k. MinimalWeight(c) sets input clique c’s edge weight to the smallest

weight value of all edges in c. UpgradeWeights(G′k) updates the weight values of graph

G′k.

Algorithm 1 Clique Mining

Input: graph Gk

1: G∗k ← NodeHash(Gk, Q);

2: G′k ← G∗

k;3: Ckout ← φ;4: while The weight of each edge in G′

k unequal to 0 do5: CG′

k← MaximalClques(G′

k);6: for all c ∈ CG′

kdo

7: c ← MinimalWeight(c);8: end for9: Ckout ← Ckout ∪ CG′

k;

10: G′k ← UpgradeWeights(G′

k)11: end whileOutput: Ckout ;

An example of the graph clique mining is illustrated in Fig. 4-4, where nodes of the

same color are compressed into one node, with the weight values in the compressed graph

being changed accordingly. After the clique mining process, graph G is decomposed into

a clique set Cout = {C1, C2, C3, C4} for graph factorization.

4.4.3 Clique Set Matrix and Graph Factorization

In this subsection, we introduce the detailed process for finding a set of discriminative

frequent cliques to form the Clique Set Matrix (M), and then use graph factorization for

feature mapping.

Discriminative Frequent Cliques

To find a good set of cliques to form base M for graph factorization, we use correlations

between each clique and graph labels (i.e. similar to Information Gain [44] and G-test

score [66]) to find cliques that are highly correlated to the classification task. For fast

discriminative clique finding, we use an “in-memory” Clique-class table Γ with (Q+ |Y|+2) columns to count the frequency of each clique with respect to each class label.

72

An example of an “in-memory” clique-class table is shown in Fig. 4-5, where Q = 4

is the size of compressed node set, which corresponds to the first 4 columns. Cols. 5 and

6 give the statistical information of two classes. Col. 7 is the value of random function

H . The final column is the objective score. ck is a new clique whose hashing result is

1531. We only compare ck with c1 and c3 and find that it is equal to c1. Then we update

the statistical information of c1. Q is the size of the compressed node set Ω and |Y| is the

number of graph class labels (The table in Fig. 4-5 contains four nodes and two classes,

i.e. Q = 4 and |Y| = 2). A clique c and its statistical information is recorded in one row,

where first Q columns represent the occurrence of the nodes in c (In Fig. 4-5, clique c1

contains nodes 1, 3, and 4). The frequencies of the clique with respect to each class (Yi)

are recorded in the next |Y| columns. In Fig. 4-5, c1 appears 85 times in one class and 20

times in another class. The clique hashing output, which helps to speed up the update of the

statistical information of each clique-pattern, is recorded in one column. In Fig. 4-5, the

7th column records the hashing output of c1 with 1531. The last column is used to record

the discrimination-test scores (objective scores) of each clique.

For each graph Gk in the stream, we first use Algorithm 1 to collect its clique set Ckout .

After that, for each clique in Ckout , say ck,j , we then apply a random hash function h(·) to

the string of ordered edges in ck,j to generate an index Hk,j ∈ {1, 2, · · · , τ}, where τ is a

control parameter. Then we check the second-last column of Γ to validate whether there

is a value equal to Hk,j . In other words, we only compare the nodes of ck,j with the nodes

on Γpo, where Hk,j = Γpo, o = Q + |Y| + 1. If they are the same clique, we update the

information by adding 1 to the corresponding class Label column to which Gk belongs.

If ck,j does not appear in the current Γ, we add a new row at the end of Γ to record this

clique. The clique hashing strategy used here is to accelerate clique matching. Instead of

comparing all cliques, we only compare cliques with the same hashing results.

Given the “in-memory” Clique-class table Γ, we can find discriminative frequent clique-

patterns by using two parameters: frequency threshold α and clique feature number m. We

sort all cliques whose frequencies are equal to or greater than α by using their objective

scores (e.g. IG) in descending order (the objective scores are listed in the last column of Γ).

We then use the top m cliques with the highest scores to generate Clique Set Matrix M.

73

Figure 4-5: An example of “in-memory” Clique-class table Γ.

Algorithm 2 Finding Discriminative Frequent Cliques

Input: Training graph set G, frequency threshold α and feature number m1: for all Gk ∈ G do2: Ckout ← CliqueMining(Gk);3: Γ ← Γ

⊎Ckout ;

4: end for5: Γ ← Select(Γ, α);6: Sort(Γ);7: M ← Top(Γ,m);

Output: M;

Algorithm 2 shows the procedure of discriminative frequent clique-pattern finding, in

which CliqueMining(·) is Algorithm 1, operator “⊎

” is updating Γ by comparing one

clique with Γ. Select(Γ, α) is a function to find cliques whose frequencies are equal to

or greater than α. Sort(Γ) sorts cliques in descending order according to their objective

scores and Top(Γ,m) returns top m rows from Γ.

Feature Mapping

Once the Clique Set Matrix M has been suitably generated, we can convert each com-

pressed graph G∗k into the same vector space by using the factorization results proposed.

74

The factorization will generate diagonal matrix Wk under objective Wk = argminW ||MWMT−A(G∗

k)||2 and diag(Wk) can then be used as fine-grained representation for G∗k.

4.4.4 Graph Stream Classification

Given the vector representation of each graph in a batch of graphs (Bi), we can use any

generic learning algorithm, such as Nearest Neighbors (NN), Naive Bayes or support vec-

tor machines (SVM), to train a classifier. The classifier is then used to predict class labels

for new graphs that have yet to arrive. As soon as the new graphs arrive and a new graph

batch is collected and labeled, we can easily update the classification model by training

a new classifier from the new batch. This procedure is detailed in Algorithm 4, where

FeatureMapping(·) is used to obtain fine-grained representation by using graph factor-

ization.

Algorithm 3 Classification

Input: Classifier ζ , GUtest and M

1: GU∗test ← NodeHash(GU

test, Q);2: Vector Vtest ← FeatureMapping(M, GU∗

test);3: ytest ← ζ(Vtest);

Output: ytest;

4.5 EXPERIMENTS

In this section we report the performance of the proposed factorization-based fast graph

stream classification method (FGSC) on real-world graph data. Our experiments mainly

focus on the study of efficiency and effectiveness.


We validate the performance of the algorithm on the following two real-world graph streams.

• DBLP Stream 1: DBLP is a computer science bibliography. Each record in DBLP

denotes one publication, and the record is associated with a number of attributes

1http://arnetminer.org/citation

75

such as the paper ID, authors, year, title, abstract, and reference ID [72]. In our ex-

periments, we build a graph stream with binary classification tasks by using papers

published in a list of selected conferences (as shown in Table 4.1). The classifica-

tion task is to predict whether a paper belongs to the field of DBDM (database and

data mining) or CVPR (computer vision and pattern recognition), by using structural

information, such as the title and references, of each paper. In our experiments, we

represent each paper in DBLP as a graph, with each node denoting a paper ID or

a keyword and each edge representing the citation relationship between papers or

keywords appearing in the paper’s title. We make the following designations: (1)

each paper ID is a node; (2) if a paper A cites another paper B, there is an edge

between A and B (we use undirected edge); (3) each keyword in the title is also a

node; (4) each paper ID node is connected to the keyword nodes of the paper; and

(5) the keyword nodes of each paper are fully connected with one another. An exam-

ple of DBLP graph data is shown in Fig.4-6: The rectangles are paper ID nodes and

diamonds are keyword nodes. The paper ID17890 cites (connects) paper ID17883

and ID18068, and ID17890 has the keywords Patch, Motion, and Invariance in its

title. Paper ID18068 has the keyword Edge and Detection, and paper ID17883’s title

includes the keywords Vision and Active. The title keywords in each paper are linked

with one another. In the experiments, all papers are arranged in chronological order

to form a graph stream. It is worth noting that the paper title keyword space is very

large (more than one thousand), and new title keywords may appear at any time, so

the node space is constantly changing and expanding, which represents the inherent

challenge of the graph stream.

Table 4.1: DBLP dataset used in experiments.

76

Figure 4-6: Graph representation for a paper (ID17890) in DBLP.

• IBM Sensor Stream 2: This stream contains information about local traffic on a

sensor network which issues a set of intrusion attack types. Each graph consti-

tutes a local pattern of traffic in the sensor network where nodes denote the IP-

addresses and edges correspond to local traffic patterns (local traffic flows between

IP-addresses) [1]. The selected data contains a stream of intrusion graphs from June

1, 2007 to June 3, 2007. Each graph is associated with a particular intrusion type and

there are over 300 different local patterns in the dataset. We only choose 20 typical

intrusion types (10 types of “BWEST” pattern and 10 types of “SNMP” pattern) that

are considered as class labels in our experiments. Our goal is to classify a traffic

flow pattern as one of the 20 intrusion types. More details about this data can be

found in [1]. Compared to the DBLP stream, each graph in the IBM stream has far

fewer nodes and the node labels have fewer overlaps between different graphs, which

makes the problem particularly difficult for accurate classification. We intentionally

choose a graph structure that is very different from the DBLP stream, therefore the

results on IBM and DBLP streams can help to evaluate the algorithm’s performance

on different types of graphs.

2http://www.charuaggarwal.net/sens1/gstream.txt

77


Baseline Methods: To evaluate the efficiency and effectiveness of our graph stream classi-

fication framework, we compare the proposed FGSC (FG+Stream) with two other meth-

ods.

• Coarse-grained representation-based method (CG+Stream) Compared with our

fine-grained representation for graphs, this method uses traditional coarse-grained

feature representation model which uses the occurrence of the discriminative frequent

clique to represent graph data (the selected cliques and the classification methods are

the same as our proposed method, except that CG+Stream uses 0/1 occurrence as

the feature value). This method is similar to a clique hashing-based graph stream

classification method in [15].

• 2-D edge hash-compressed stream classifier (EH+Stream) This method employs

a 2-D random edge hashing scheme to construct an “in-memory” summary for the

sequentially presented graphs [1]. The first random-hash scheme is used to reduce

the size of the edge set. The second min-hash scheme is used to dynamically update a

number of hash-codes, which is able to summarize frequent patterns of co-occurrence

edges observed so far in the graph stream. A simple heuristic is used to select a set of

most discriminative frequent patterns to build a rule-based classifier for classification.

In the experiments, we use 10-fold cross-validation to evaluate the performance of the

graph stream classification. For fair comparison, all three methods (FG+Stream, CG+Stream,

and EH+Stream) use different chunk sizes and feature numbers to predict the class labels

of graphs in the stream. For EH+Stream, Nearest Neighbor classifier (NN) is used as the

classifier in their method, so we cannot apply other learning algorithms to this method.

For the other two methods (FG+Stream and CG+Stream), we use three different classifiers,

including Nearest Neighbor (NN), Support Vector Machines (SMO), and Naive Bayes, to

form an ensemble and predict the class labels of graphs in a future chunk. The default

parameter settings are as follows: Batch size |D| = {600, 800, 1000} (for DBLP) and

{300, 400, 500} (for IBM), feature size |m| = {62, 142, 307} with frequency threshold

78

α = {2%, 1%, 0.5%} (for DBLP) and |m| = {43, 75, 148} with α = {0.3%, 0.1%, 0.06%}respectively (for IBM). All experimental results are collected from a Linux cluster comput-

ing node with an Intel(R) Xeon(R)@3.33GHZ CPU and 4GB fixed memory size.

4.5.3 Experimental Results

In this section, we report the algorithm performance with respect to the classification accu-

racy and classification efficiency of the graph stream.

Graph Steams Classification Accuracy

Results on different chunk sizes |D|: In Figs. 4-7 and 4-10, we report the algorithm

performance by using different numbers of graphs in each chunk |D| (varying from 1000,

800, to 600 for the DBLP stream and from 500, 400, to 300 for the IBM stream).

Overall, the results in Fig. 4-7 show that the accuracies of FG+Stream on DBLP stream

are better than that of CG+Stream and EH+Stream. The EH+Stream method has the worst

performance, and its accuracy is slightly over 50%, which is nearly equivalent to random

predictions. The accuracy of FG+Stream, on all three experiments, is consistently better

than that of CG+Stream and EH+Stream, even though the accuracy fluctuates to a large

extent on some chunks (such as Batch 14 in Fig. 4-7 (b)). Recall that the only difference

between FG+Stream and CG+Steam is that the former uses fine-grained feature representa-

tions whereas the latter uses 0/1 occurrence (both methods have the same set of discrimina-

tive cliques). As a result, the better performance of FG+Stream, compared to CG+Stream,

can be attributed to the fact that fine-grained representation provides more accurate in-

formation to describe each single graph and ensures minimal information loss for graph

representation. As we have explained in Section 4 (Graph Factorization), FG+Stream uses

a linear combination of cliques to represent each graph, so even if some features do not

appear in a specific graph, they may still have a value to indicate each clique’s correlation

to the graph. By contrast, CG+Stream can only assign zero value to indicate that the fea-

ture does not appear in the graph. As a result, the feature information given by fine-grained

representation ensures good accuracy for graph classification.

79

1 3 5 7 9 11 13 15 17 19 21 23 250.3

0.4

0.5

0.6

0.7

0.8DBLP |D|=1000,|m|=142, NN

Batch ID

Acc

urac

y %

FG+Stream CG+Stream EH+Stream

(a)

1 3 5 7 9 11 13 15 17 19 21 23 250.3

0.4

0.5

0.6

0.7

0.8DBLP |D|=800,|m|=142, NN

Batch ID

Acc

urac

y %


(b)

1 3 5 7 9 11 13 15 17 19 21 23 250.3

0.4

0.5

0.6

0.7

0.8DBLP |D|=600,|m|=142, NN

Batch ID

Acc

urac

y %


(c)

Figure 4-7: Accuracy w.r.t different chunk sizes on DBLP Stream. The number of features

in each chunk is 142. The batch sizes vary as: (a) 1000; (b) 800; (c) 600.

1 3 5 7 9 11 13 15 17 19 21 23 250.3

0.4

0.5

0.6

0.7

0.8

0.9DBLP |D|=1000,|m|=307, NN

Batch ID

Acc

urac

y %


(a)

1 3 5 7 9 11 13 15 17 19 21 23 250.3

0.4

0.5

0.6

0.7

0.8DBLP |D|=1000,|m|=142, NN

Batch ID

Acc

urac

y %


(b)

1 3 5 7 9 11 13 15 17 19 21 23 250.3

0.4

0.5

0.6

0.7

0.8DBLP |D|=1000,|m|=62, NN

Batch ID

Acc

urac

y %


(c)

Figure 4-8: Accuracy w.r.t different number of features on DBLP Stream with each chunk

containing 1000 graphs. The number of features selected in each chunk is: (a) 307; (b)

142; (c) 62.

1 3 5 7 9 11 13 15 17 19 21 23 250.3

0.4

0.5

0.6

0.7

0.8DBLP |D|=1000,|m|=142, NN

Batch ID

Acc

urac

y %


(a)

1 3 5 7 9 11 13 15 17 19 21 23 250.6

0.65

0.7

0.75

0.8DBLP |D|=1000,|m|=142, SMO

Batch ID

Acc

urac

y %

FG+StreamCG+Stream

(b)

1 3 5 7 9 11 13 15 17 19 21 23 250.62

0.64

0.66

0.68

0.7DBLP |D|=1000,|m|=142, NaiveBayes

Batch ID

Acc

urac

y %

FG+StreamCG+Stream

(c)

Figure 4-9: Accuracy w.r.t different classification methods on DBLP Stream with each

chunk containing 1000 graphs, and the number of features in each chunk is 142. The

classification methods selected here are: (a) NN; (b) SMO; (c) NaiveBayes.

The results in Fig. 4-10 show that, for the IBM stream, the accuracies of FG+Stream

and CG+Stream are almost identical across the whole stream. This is mainly because that

each graph in the IBM stream is composed of only a few nodes (most graphs contain less

than four nodes). As a result, the features mined from the training set contain only a few

nodes (e.g. one or two nodes) and the overlap between any two features (i.e. the same nodes

and edges shared by two features) is less frequent. The feature weight values calculated

by our graph factorization method thus degrade to 0/1 occurrence of the features. As a

80

1 3 5 7 9 11 13 15 17 19 21 23 25 270

0.2

0.4

0.6

0.8

1IBM |D|=500,|m|=75, NN

Batch ID

Acc

urac

y %


(a)

1 4 7 10 13 16 19 22 25 28 31 340

0.2

0.4

0.6

0.8

1IBM |D|=400,|m|=75, NN

Batch ID

Acc

urac

y %


(b)

1 5 9 13 17 21 25 29 33 37 41 450

0.2

0.4

0.6

0.8

1IBM |D|=300,|m|=75, NN

Batch ID

Acc

urac

y %


(c)

Figure 4-10: Accuracy w.r.t different chunk sizes on IBM Stream. The number of features

in each chunk is 75. The batch sizes vary from (a) 500; (b) 400; to (c) 300.

1 4 7 10 13 16 19 22 25 28 31 340

0.2

0.4

0.6

0.8

1IBM |D|=400,|m|=148, NN

Batch ID

Acc

urac

y %


(a)

1 4 7 10 13 16 19 22 25 28 31 340

0.2

0.4

0.6

0.8

1IBM |D|=400,|m|=75, NN

Batch ID

Acc

urac

y %


(b)

1 4 7 10 13 16 19 22 25 28 31 340

0.2

0.4

0.6

0.8

1IBM |D|=400,|m|=43, NN

Batch ID

Acc

urac

y %


(c)

Figure 4-11: Accuracy w.r.t different number of features on IBM Stream with each chunk

containing 400 graphs. The number of features selected in each chunk is: (a) 148; (b) 75;

(c) 43.

1 4 7 10 13 16 19 22 25 28 31 340

0.2

0.4

0.6

0.8

1IBM |D|=400,|m|=75, NN

Batch ID

Acc

urac

y %


(a)

1 4 7 10 13 16 19 22 25 28 31 340.2

0.4

0.6

0.8

1IBM |D|=400,|m|=75, SMO

Batch ID

Acc

urac

y %

FG+StreamCG+Stream

(b)

1 4 7 10 13 16 19 22 25 28 31 340.2

0.4

0.6

0.8

1IBM |D|=400,|m|=75, NaiveBayes

Batch ID

Acc

urac

y %

FG+StreamCG+Stream

(c)

Figure 4-12: Accuracy w.r.t different classification methods on IBM Stream with each

chunk containing 400 graphs, and the number of features in each chunk is 75. The classifi-

cation methods selected include: (a) NN; (b) SMO; (c) NaiveBayes.

result, the vector representation obtained by using FG+Stream is almost the same as that of

CG+Stream, which results in the same classification accuracy. The accuracy of EH+Stream

is significantly worse than the other two methods. As shown in Fig. 4-10 (a), no graph is

correctly classified for chunks 14-15, 17-19, 21, and 25-27. This demonstrates that for

graphs with very few nodes and edges, using clique hashing, as EH+Stream does, cannot

obtain good classification accuracy.

81

Results on different numbers of features |m|: In Figs. 4-8 and 4-11, we report the algo-

rithm’s performance with respect to the different number of features in each chunk.

As expected, FG+Stream has the best performance of the three algorithms for the DBLP

stream. In addition, the results in Fig. 4-8 show that increasing the number of features in the

DBLP stream actually increases the accuracy of the CG+Stream method, while the accu-

racy of FG+Stream method remains relatively stable. This is because CG+Stream directly

uses select clique features to represent a graph whereas FG+Stream uses combinations of

clique features for representation. As a result, FG+Stream has less dependence on the

number of features selected for graph representation.

Interestingly, even though the accuracy of all three methods fluctuates to a large extent

across the whole graph stream, the results in Fig.4-11 show that the accuracy of each is

almost the same within the changing of the feature sizes. This may suggest that for the IBM

stream, a small number of features (such as special IP-addresses or IP-address connections)

may already have enough discriminate power for classification. Increasing the number of

features may only result in redundant features in the learning process, which is not only

time-consuming for learning but also not effective in improving classification accuracy.

Results on different classifiers: For the EH+Stream method, the authors use a nearest

neighbor classifier that scans the training data and re-organizes edges into graphs, and then

finds the closest graph for the test instance [1]. Essentially, this is a Nearest Neighbor

classifier, so we cannot apply other classifiers to the EH+Stream and validate its perfor-

mance. We validate the performance of FG+Stream and CG+Stream by using three ma-

jor classifiers, including Nearest Neighbor (NN), Support Vector Machines (SVM), and

Naive Bayes. The results in Fig. 4-9 show that, for DBLP data, Naive Bayes has the worst

performance compared to NN and SVM, and when using Naive Bayes, the accuracies of

FG+Stream and CG+Stream fluctuate significantly across the whole stream. Meanwhile,

NN and SVM have comparable performance in most cases. For the IBM stream, the clas-

sification accuracies of all three classifiers are very close to one another.

82

1 5 9 13 17 21 250

2

4

6

8

10

12

14Accumulated Time on DBLP stream

Batch ID

Acc

umul

ated

Tim

e (

105 m

s)


X

(a)

1 4 7 10 13 16 19 22 25 28 31 340

2

4

6

8

10Accumulated Time on IBM stream

Batch ID

Acc

umul

ated

Tim

e (

105 m

s)


X

(b)

Figure 4-13: System accumulated runtime-based by using NN classifier, where |D| =1000, |m| = 142 (for DBLP) and |D| = 400, |m| = 75 (for IBM) respectively. (a) Results

on DBLP stream; (b) Results on IBM stream.

Graph Steam Classification Efficiency

In Fig. 4-13, we report the system runtime performance (efficiency) for graph stream pro-

cessing. The results show that FG+Stream always requires less time than CG+Stream. This

is mainly because FG+Stream avoids the expensive sub-graph isomorphism validation pro-

cess for vector generation, which is very time-consuming. EH+Stream performs better

than the other two methods on the IBM stream and has a very fast runtime for the first few

chunks. This is because EH+Stream always keeps a feature table of fixed size. The features

in the table will be replaced once a better feature is detected. The size of graphs in the IBM

stream is very small (only several nodes) compared to the DBLP stream, so features are

found which lead to less feature valuation and replacement for updating.

Overall, the results show that FG+Stream linearly scales to the number of chunks. The

runtime of FG+Stream is acceptable compared to its high classification accuracy. As a

result, FG+Stream is a practical solution for handling real-world high speed graph streams.

83

84

Chapter 5

Super-graph based Classification

5.1 INTRODUCTION

Many applications, such as social networks and citation networks, commonly use graph

structure to represent data entries (i.e. nodes) and their structural relationships (i.e. edges).

When using graphs to represent objects, all existing frameworks rely on two approaches

to describe node content (1) node as a single attribute: each node has only one attribute

(single-attribute node). A clear drawback of this representation is that a single attribute

cannot precisely describe the node content [40]. This representation is commonly referred

to as a single-attribute graph (Fig.5-1 (A)). (2) node as a set of attributes: use a set of

independent attributes to describe the node content (Fig.5-1 (B)). This representation is

commonly referred to as an attributed graph [8, 14, 80].

Both single-attribute and attributed graphs emphasize the dependency structures be-

tween objections (i.e. nodes) but inherently overlook internal structures inside each node,

where the attributes/properties used to describe the node content may be subject to depen-

dency structures as well. Indeed, in many applications, the attributes/properties used to

describe the node content may be subject to dependency structures. For example, in a ci-

tation network each node represents one paper and edges denote citation relationships. It

is insufficient to use one or multiple independent attributes to describe detailed informa-

tion of a paper. Instead, we can represent the content of each paper as a graph with nodes

denoting keywords and edges representing contextual correlations between keywords (e.g.

85

Figure 5-1: (A): a single-attribute graph; (B): an attributed graph; and (C): a super-graph.

co-occurrence of keywords in different sentences or paragraphs). As a result, each paper

and all references cited in this thesis can form a super-graph with each edge between pa-

pers denoting their citation relationships. While in a protein interaction network each node

represents one protein and edges denote interaction relationships. A protein may undergo

reversible structural changes in performing its biological function and these changes are

likely not shown in sequence representation [85]. So just using sequence is insufficient to

describe a protein as shown in Fig. 5-2, where each protein is represented as a super-node

by using its molecular structure. Proteins form a super-graph with edges between proteins

denoting their interaction relationships. Two interacted proteins may share functional or

structure similarity. In this chapter, we refer to this type of graph, where the content of the

node can be represented as a graph, as a “super-graph”. Likewise, we refer to the node

whose content is represented as a graph, as a “super-node”.

Given a set of super-graphs, classification problem is to identify which category a super-

graph belongs, based on the known information, like labeling information, super-node

structure, node content, etc. To build learning models for super-graph based classification,

the mainly challenge is to properly calculate the distance between two super-graphs.

• Similarity between two super-nodes: Because each super-node is a graph, the over-

lapped/intersected graph structure between two super-nodes reveals the similarity

between two super-nodes, as well as the relationship between two super-graphs. Tra-

ditional hard-node-matching mechanism is unsuitable for super-graphs that require

soft-node-matching.

86

Figure 5-2: A conceptual view of a protein interaction network using super-graph repre-

sentation.

• Similarity between two super-graphs: The complex structure of super-graph re-

quires that the similarity measure considers not only the structure similarity, but also

the super-node similarity between two super-graphs. This cannot be achieved with-



The above challenges motivate the proposed Weighted Random Walk Kernel (WRWK)

for super-graphs. In this chapter, we generate a new product graph from two super-graphs

and then use weighted random walks on the product graph to calculate similarity between

super-graphs. A weighted random walk denotes a walk starting from a random weighted

node and following succeeding weighted nodes and edges in a random manner. The weight

of the node in the product graph denotes the similarity of two super-nodes. Given a set

of labeled super-graphs, we can use an weighted product graph to establish walk-based

relationship between two super-graphs and calculate their similarities. After that, we can

obtain the kernel matrix for super-graph classification. Experiments on two real-world

super-graph classification tasks demonstrate that the proposed method significantly outper-

forms baseline approaches.

The remainder of the chapter is structured as follows. Section 5.2 formulates the prob-

lem and defines important notations. The weighted random walk kernel is proposed in

87

Section 5.4. The super-graph classification algorithm is given in Section 5.5, followed by

theoretical analysis and theorems in Section 5.6. Experiments are reported in Section 5.7.

5.2 PROBLEM DEFINITION

DEFINITION 16 (Single-attribute Graph) A single-attribute graph is represented as g =

(V,E,Att, f), where V = {v1, v2, · · · , vn} is a finite set of vertices, E ⊆ V × V denotes

a finite set of edges, and f : V → Att is an injective function from the vertex set V

to the attribute set Att = {a1, a2, · · · , am}. An attribute ai is a single symbol, e.g. a

keyword, denoting node content.−−−−→Att(g) = [a1(g), · · · , am(g)] means the attribute vector.

ai(g) =∑n

j=1 h(f(uj), ai), where h(f(uj), ai) = 1, if f(uj) = ai and h(f(uj), ai) = 0,

otherwise.

DEFINITION 17 (Super-graph and Super-node) A super-graph is represented as G =

(V , E ,G,F), where V = {V1, V2, · · · , VN} is a finite set of graph-structured nodes. E ⊆V × V denotes a finite set of edges, and F : E → G is an injective function from E to

G, where G = {g1, g2, · · · , gM} is the set of single-attribute graphs. A node in the super-

graph, which is represented by a single-attribute graph, is called a Super-node.

Formally, a single-attribute graph g = (V,E,Att, f) can be uniquely described by its

attribute and adjacency matrices. The attribute matrix ϕ is defined by ϕri = 1 ⇔ ai ∈f(vr), otherwise ϕri = 0. The adjacency matrix θ is defined by θij = 1 ⇔ (vi, vj) ∈ E,

otherwise θij = 0. Similarly, the adjacency matrix Θ of an super-graph G = (V , E ,G,F)is defined by Θij = 1 ⇔ (Vi, Vj) ∈ E , otherwise Θij = 0. However, because of the

complicated structure of super-graph, we cannot use an unique matrix to describe its super-

node information. To calculate conveniently as this article shows later, an super attribute

matrix Φ ∈ RN×S is defined for super-graph G, where N = |V| and S = |Attg1 ∪ Attg2 ∪

· · · ∪ AttgM |. Φri = 1 ⇔ ai ∈ Attgr , otherwise Φri = 0.

A super-graph Gi is a labeled graph if a class label L(Gi) ∈ Y = {y1, y2, · · · , yx} is

assigned to Gi. A super-graph Gi is either labeled (GLi ) or unlabeled (GU

i ).

88

DEFINITION 18 (Super-graph Classification) Given a set of labeled super-graphs DL =

{GL1 , G

L2 , · · · }, the goal of super-graph classification is to learn a discriminative model

from DL to predict some previously unseen super-graphs DU = {GU1 , G

U2 , · · · } with maxi-

mum accuracy.

5.3 OVERALL FRAMEWORK

In Fig. 5-3, we list the major steps of our super-graph based classification framework.

WRWK on the super-graphs (G, G′) and the single-attribute graphs (g1, g2, g3), where

g1⊗2 and g1⊗3 are the single-attribute product graphs for (g1, g2) and (g1, g3), respectively,

and G⊗ is the super product graph for (G,G′). (θ1, ϕ1) and (Θ,Φ) are the adjacency and

attribute matrices for g1 and G, respectively. w1⊗2 and w are the weight vectors for g1⊗2 and

G⊗, respectively. Each element of w is equal to the WRWK of two super-nodes (as dotted

arrows show). A random walk on single-attribute product graph, say g1⊗3, is equivalent

to performing simultaneous random walks on graphs g1 and g3, respectively. Assume that

v1(a1)means node v1 with attribute a1 in g1. The walk v1(a1) → v3(a5) → v4(a1) in g1 can

match the walk v1′′(a1) → v3′′(a5) → v2′′(a1) in g3. The corresponding walk on the single-

attribute product graph g1⊗3 is: v11′′(a1) → v33′′(a5) → v42′′(a1). k(g1, g2) and k(g1, g3)

are the kernels. k(g1, g2) < k(g1, g3) means g3 is more similar to g1 than g2. Likewise,

K(G,G′) is the WRWK of G and G′. The essential challenge is twofold: (1) Super-node

similarity measurement; and (2) Super-graph similarity measurement.

5.4 WEIGHTED RANDOM WALK KERNEL

Defining a kernel on a space X paves the way for using kernel methods [60] for classifica-

tion, regression, and clustering. To define a kernel function for super-graphs, we employ a

random walk kernel principle [25].

Our Weighted Random Walk Kernel (WRWK) is based on a simple idea: Given a pair

of super-graphs (G1 and G2), we can use them to build a product graph, where each node

of the product graph contains attributes shared by super-nodes in G1 and G2. For each node

89

Figure 5-3: WRWK on the super-graphs (G, G′) and the single-attribute graphs (g1, g2,g3).

in the product graph, we can assign a weight value (which is based on the similarity of the

two super-nodes generating the current node). Because a weighted product node means the

common weight of the node appeared in both G1 and G2, we can perform random walks

through these nodes to measure the similarity by counting the number of matching walks

(the walks through nodes containing intersected attribute sets) and combining weights of

the nodes. The larger the number, the more similar the two super-graphs are. Adding

weight value to the random walk is meaningful. It provides a solution to take graph match-

ing of super-nodes into consideration to calculate the similarity between super-graphs. We

consider the similarity between any two super-nodes as the weight value instead of just us-

ing 1/0 hard-matching. The node similarities between different super-nodes represent the

relationship between two graphs in a more precise way, which, in turn, helps improve the

classification accuracy. Accordingly, we divide our kernel design into two parts based on

the similarity of super-nodes and super-graphs.

Graph kernel provides a natural tool to answer the above questions. Among others, one

of the most powerful graph kernel is based on random walks, and has been successfully

applied to many real world applications. There are many advantages of using random walk

kernel. Firstly, the matching paths between two graphs consider the node similarities as

90

only paths with same sequential attributed nodes can be matched. Secondly, random walk

kernel method also consider the structure similarities as it traverses all matching paths

between two graphs with no tottering. These paths contain all structure information and

have been approved by previous work that better accuracy can be achieved on classification

or clustering benchmarks.

5.4.1 Kernel on Single-attribute Graphs

We firstly introduce the weighted random walk kernel on single-attribute graphs.

DEFINITION 19 (Single-attribute Product Graph) Given two single-attribute graphs

g1 = (V1, E1, Att1, f1) and g2 = (V2, E2, Att2, f2), their single-attribute product graph

is denoted by g1⊗2 = (V ∗, E∗, Att∗, f ∗) (g⊗ for short), where

• V ∗ = {v|v =< v1, v2 >, v1 ∈ V1, v2 ∈ V2};

• E∗ = {e|e = (u′, v′), u′ ∈ V ∗, v′ ∈ V ∗, f ∗(u′) = φ, f ∗(v′) = φ,

u′ =< u1, u2 >, v′ =< v1, v2 >, (u1, v1) ∈ E1, (u2, v2) ∈ E2};

• Att∗ = Att1 ∪ Att2;

• f ∗ = {f ∗(v)|f ∗(v) = f(v1) ∩ f(v2), v =< v1, v2 >, v1 ∈ V1, v2 ∈ V2}.

In other words, g⊗ is a single-attribute graph where a vertex v is the intersection be-

tween a pair of nodes in g1 and g2. There is an edge between a pair of vertices in g⊗, if and

only if an edge exists in corresponding vertices in g1 and g2, respectively. An example is

shown in Fig. 5-3 (g1⊗2). In the following, we show that an inherent property of the prod-

uct graph is that performing a weighted random walk on the product graph is equivalent to

performing simultaneous random walks on g1 and g2, respectively. So the single-attribute

product graph provides an effective way to count the number of walks combining weight

values on nodes between graphs without expensive graph matching.

To generate g⊗’s adjacency matrix θ⊗ from g1 and g2 by using matrix operations, we

define the Attributed Product as follow:

91

DEFINITION 20 (Attributed Product) Given matrices B ∈ Rn×n, C ∈ R

m×m and

H ∈ Rn′×m′

, the attributed product B � C ∈ Rnm×nm and the column-stacking opera-

tor vec(H) ∈ Rn′m′

are defined as

B � C = [vec(B∗1C1∗) vec(B∗1C2∗) · · · vec(B∗nCm∗)],

vec(H) = [H�∗1 H

�∗2 · · · H�

∗n]�,

where H∗i and Hj∗ denote ith column and jth row of H , respectively.

Based on Def. 20, the adjacency matrix θ⊗ of the single-attribute product graph g⊗ can

be directly derived from g1(θ1, ϕ1) and g2(θ2, ϕ2) as follow:

θ⊗ = (θ1 � θ2) � vec(ϕ1ϕ�2 ) (5.1)

where

B �

⎡⎢⎢⎢⎢⎢⎢⎣x1

...

xn

⎤⎥⎥⎥⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎢⎢⎣B11 ∧ x1 · · · B1n ∧ x1

.... . .

...

Bn1 ∧ xn · · · Bnn ∧ xn

⎤⎥⎥⎥⎥⎥⎥⎦ ,

“∧” is a conjunction operation (a ∧ b = 1 iff a = 1 and b = 1).

As a result of the above process, we can matrix operation to count random walk. More

specifically, for the adjacency matrix θx of a graph gx, each element [θzx]ij of this nth power

matrix provides the number of walks of length z from ui to uj in gx.

According to Eq. (5.1), performing a random walk on the single-attribute product graph

g⊗ is equivalent to performing simultaneous random walks on the graphs g1 and g2. After

g⊗ is generated, Attributed Random Walk Kernel (ARWK), which computes the similarity

between g1 and g2, can be defined with a sequence of weights δ = δ0, δ1, · · · (δi ∈ R and

δi ≥ 0 for all i ∈ N):

K(g1, g2) = ρ

n1n2∑i,j=1

[ ∞∑z=1

δz θz⊗

]ij

(5.2)

92

where n1 and n2 are the node sizes of g1 and g2, respectively. ρ is the control parameter

used to make the kernel function convergence. As a result, kernel values are upper bounded

(the proof is given later).

To compute the ARWK for single-attribute graphs, as defined in Eq. (5.2), a diago-

nalization decomposition method [25] can be used. Because θ⊗ is a symmetric matrix, the

diagonalization decomposition of θ⊗ exists: θ⊗ = T HT −1, where the columns of T are its

eigenvectors, and H is a diagonal matrix of corresponding eigenvalues. The kernel defined

in Eq. (5.2) can then be rewritten as:

K(g1, g2) = ρ

n1n2∑i,j=1

[ ∞∑z=1

δz (T HzT −1)

]ij

= ρ

n1n2∑i,j=1

[T (

∞∑z=1

δz Hz) T −1

]ij

(5.3)

By setting δz = λz/z! in Eq. (5.3), and use ex =∑∞

z=0 xz/z!, we have,

K(g1, g2) = ρ

n1n2∑i,j=1

[T (eλH − I) T −1

]ij

(5.4)

Where I is an identity matrix with the same size as θ⊗. The diagonalization decompo-

sition can greatly expedite ARWK kernel computation.

Here we set

ρ =1

c(eλ(c−1) − 1), where c =

n1∑i

n2∑j

[ϕ1ϕ�2 ]ij (5.5)

Then we have the following theorem.

THEOREM 2 Given any two single-attribute graphs g1 and g2, the attributed random

walk kernel of these two graphs is bounded by 0 ≤ K(g1, g2) ≤ 1.

The proposed ARWK is different to the traditional random walk kernel, which simply

calculates the total number of shared random walks between two graphs, but the number of

shared walks can increase as the number of nodes/edges increase. As a result, the similarity

based on traditional random walk kernel is not bounded. In comparison, ARWK first finds

all shared nodes between two graphs (where a shared node means a node with the same

attribute in both graphs), and then calculates the ratio between the number of random walks

93

among shared nodes of two graphs and the number of random walks on the complete graph

formed by using all shared nodes (the number of random walks on the complete graph

is calculated by ρ−1 as Eq. (5.5) which is proved in Theorem 4). We call this ratio the

Structural Integrity Ratio, which is proved to be bounded. This not only provides a bounded

measure for similarity assessment, but also provides an effective way to combine attribute

and structure information to assess similarity between two graphs under the shared node

information.

We also call the proposed ARWK as Weighted Random Walk Kernel (WRWK) for

single-attribute graph. The diagonalization decomposition can greatly expedite WRWK

kernel computation. An example of WRWK kernel is shown in Fig. 5-3.

5.4.2 Kernel on Super-Graphs

The WRWK of single-attribute graph helps calculate the similarity between two graphs,

so it can be used to calculate the similarity between two super-nodes. Given a pair of

super-graphs G1 and G2, assume we can generate a new product graph G⊗ whose nodes

are generated by super-nodes in G1 and G2, and weight value of each node is the similarity

between the super-nodes which generate this node, then the same process shown in Sec-

tion 5.4.1 can be used to calculate the WRWK for super-graphs G1 and G2 to denote their

similarity.

DEFINITION 21 (Super Product Graph) Given two super-graphs G1 = (V1, E1,G1,F1)

and G2 = (V2, E2,G2,F2), their Super Product Graph is denoted by G1⊗2 = (V∗, E∗,G∗,F∗)

(G⊗ for short), where

• V∗ = {V |V =< V1, V2 >, V1 ∈ V1, V2 ∈ V2};

• E∗ = {e|e = (V, V ′), V ∈ V∗, V ′ ∈ V∗,F∗(V ) = φ,F∗(V ′) = φ,

V =< V1, V2 >, V ′ =< V ′1 , V

′2 >, (V1, V

′1) ∈ E1, (V2, V

′2) ∈ E2};

• G∗ = G1 ∪ G2;

• F∗ = {F∗(V )|F∗(V ) = F(V1) ∩ F(V2), V =< V1, V2 >, V1 ∈ V1, V2 ∈ V2}.

94

An example of super product graph is shown in Fig. 5-3 (G⊗). Similar to Eq. (5.1),

the adjacency matrix Θ⊗ of the super product graph G⊗ can be directly derived from

G1(Θ1,Φ1) and G2(Θ2,Φ2) as follows:

Θ⊗ = (Θ1 �Θ2) � vec(Φ1Φ�2 ) (5.6)

Because we use the similarity between two super-nodes as the weight value of the node

in the super product graph, the kernel value may increase infinitely with the increasing

size of the super-graph. So for super-graph kernel, we add a control variable η to limit the

range of the super-graph kernel value. Then the Weighted Random Walk Kernel, which

computes the similarity between G1 and G2, can be defined with a sequence of weights

σ = σ0, σ1, · · · (σi ∈ R and σi ≥ 0 for all i ∈ N):

K(G1, G2) = η

N1N2∑i,j=1

[ ∞∑z=0

σz (w w� �Θ⊗)z]ij

(5.7)

where N1 and N2 are the node sizes of G1 and G2, respectively, and

w = [k(g1, g1) k(g1, g2) · · · k(gN1 , gN2)]�

Similar to WRWK of single-attribute graphs in Eq. (5.4), the WRWK on super-graphs

can be calculated by setting σz = γz/z!, we have,

K(G1, G2) = η

N1N2∑i,j=1

[T eγ

˜H T −1]ij

(5.8)

where w w� �Θ⊗ = T HT −1. To ensure kernel function convergence, we set

η =1

N1N2

e1−N2

1N22

N1N2γe2λ (5.9)

where γ is the parameter of WRWK on super-graphs, and λ is the parameter of

WRWK on single-attribute graphs which is given in previous sub-section. The WRWK

of super-graphs is also upper bounded with proof showing in Section 5.6.

95

5.5 SUPER-GRAPH CLASSIFICATION

The WRWK provides an effective way to measure the similarity between super-graphs.

Given a number of labeled super-graphs, we can use their pair-wise similarity to form an

kernel matrix. Then generic classifiers, such as Support Vector Machine (SVM), Decision

Tree (DT), Naive Bayes (NB) and Nearest Neighbour (NN), can be applied to the kernel

matrix for super-graph classification.

Algorithm 4 shows the framework of using WRWK to train a classifier.

Algorithm 4 WRWK Classifier Generation

Input: Labeled super-graph set DL, Kernel parameters λ and γOutput: Classifier ζInitialize: Kernel matrix M ← [ ]1: for any two super-graphs Ga and Gb in DL do2: �w ← [ ]3: for any gx in Ga and gy in Gb do4: q ← (x− 1)Nb + y // q is the element index of �w5: if Attx ∩Atty = φ then6: �wq ← 07: else8: ρ ← 1

nxny, w ← [ 1

nxny

1nxny

· · · 1nxny

], θ⊗ ← (θx � θy) � vec(ϕxϕ�y )

9: �wq ← k(gx, gy) // k(gx, gy) is calculated by eq. (5.4)10: end if11: end for

12: η ← 1NaNb

e1−N2

aN2b

NaNbγe2λ

, Θ⊗ ← (Θa �Θb) � vec(ΦaΦ�b )

13: Mab ← K(Ga, Gb) // K(Ga, Gb) is calculated by eq. (5.8)14: end for15: ζ ← TrainClassifier(C, M , L)

/* C is the learning algorithm. L is the class label vector of DL */

5.6 THEORETICAL STUDY

In this section, we derive theoretical basis to show the rationality of WRWK.

THEOREM 3 The Weighted Random Walk Kernel function is positive definite.

Proof 1 As the random walk-based kernel is closed under products [25] and the WRWK

can be written as the limit of a polynomial series with positive coefficients (as Eqs. (5.4)

and (5.8)), the WRWK function is positive definite.

96

THEOREM 4 Given any two single-attribute graphs g1 and g2, the weighted random walk

kernel of these two graphs is bounded by 0 < k(g1, g2) < eλ, where λ is the parameter of

the weighted random walk kernel.

Proof 2 Because k(g1, g2) is a positive definite kernel by Theorem 3, so k(g1, g2) > 0.

Then we only need to show that the upper bound of k(g1, g2) is eλ.

Based on the definition of WRWK, assume the node sizes of two single-attribute graphs

g1 and g2 are n1 and n2, respectively, the number of random walks on g1 ⊗ g2 must be not

greater than that of the complete connect graph gc which has n1 × n2 nodes. We assume

that gc = g′1 ⊗ g′2, where g′1 and g′2 are with n1 and n2 nodes respectively. So we have

k(g1, g2) < k(g′1, g′2). More specifically,

k(g′1, g′2) = ρ

n1n2∑i,j=1

[ ∞∑z=0

δz (w′ w′� � θ′⊗)

z

]ij

= ρ∞∑z=0

{δzn1n2∑i,j=1

[(w′ w′� � θ′⊗)

z]ij},

Because ρ = 1n1n2

, w′ = [ 1n1n2

, 1n1n2

, · · · , 1n1n2

]�, and gc is a complete connect graph,

where the diagonal elements are equal to 0 and other element in the adjacency matrix are

equal to 1.

Son1n2∑i,j=1

[(w′ w′� � θ′⊗)

z]ij= (

n1n2 − 1

n21n

22

)(z−1)(1− 1

n1n2

),

Then

k(g′1, g′2) =

1

n1n2

∞∑z=0

[δz(n1n2 − 1

n21n

22

)(z−1)(1− 1

n1n2

)],

=1

n1n2

(1− 1

n1n2

)∞∑z=0

[δz(n1n2 − 1

n21n

22

)(z−1)],

Because δz =λz

z!, we have

k(g′1, g′2) =

1

n1n2

(1− 1

n1n2

)∞∑z=0

[λz

z!(n1n2 − 1

n21n

22

)(z−1)],

97

=1

n1n2

(1− 1

n1n2

)(n21n

22

n1n2 − 1)

∞∑z=0

[λz

z!(n1n2 − 1

n21n

22

)z],

Because ex =∑∞

z=0 xz/z!, we have

k(g1, g2) < k(g′1, g′2) = e

λ(n1n2−1)

n21n

22 < eλ.

THEOREM 5 The weighted random walk kernel between super-graphs G1 and G2 is

bounded by 0 < K(G1, G2) < eγe2λ−2λ, where λ and γ are weighted random walk ker-

nel parameters for single-attribute graph and super-graph, respectively.

Similar to Theorem 4, Theorem 5 can be derived.

5.7 EXPERIMENTS AND ANALYSIS

In this section we first discuss the benchmark data and the experimental settings, and then

report detailed experimental results and analysis.


We carry out experimental studies on real-world datasets from two different domains:

DBLP scientific publication dataset and Beer Review dataset.

• DBLP Dataset: DBLP dataset consists of bibliography data in computer science1.

Each record in DBLP is a scientific publication with a number of attributes such as

abstract, authors, year, venue, title, and references. To build super-graphs, we se-

lect papers published in Artificial Intelligence (AI: IJCAI, AAAI, NIPS, UAI, COLT,

ACL, KR, ICML, ECML and IJCNN) and Computer Vision (CV: ICCV, CVPR,

ECCV, ICPR, ICIP, ACM Multimedia and ICME) fields to form a classification task.

The goal is to predict which field (AI or CV) a paper (i.e. a super-graph) belongs

to by using the abstract of each paper (i.e. a super-node) and abstracts of references

(i.e. other super-nodes), as shown in Fig. 5-4, where the left figure shows that each

paper cites a number of references. For each paper (or each reference), its abstract

1http://arnetminer.org/citation/

98

(a) (b)

Figure 5-4: An example of using super-graph representation for scientific publications.

can be converted as a graph. So each paper denotes a super-node, and the citation re-

lationships between papers form a super-graph. The sub-figure of Fig.5-4 to the right

shows a graph representation of paper abstract with each node denoting a keyword.

A: The weight values between nodes indicate correlations between keywords. B: by

using proper threshold, we can convert each abstract as an undirected unweighted

graph. An edge (undirected) between two super-nodes indicates a citation relation-

ship between two papers. For each paper, we use fuzzy cognitive map (E-FCM) [50]

to convert paper abstract into a graph (which represents relations between keywords

with weights over a threshold (i.e. edge-cutting threshold)). This graph representa-

tion has shown better performance than simple bag-of-words representation [4]. We

select 1000 papers (500 in each class), each of which contains 1 to 10 references, to

form 1000 super-graphs.

• Beer Review Dataset: The online beer review dataset consists of review data for

beers2. Each review in the dataset is associated with some attributes such as appear-

ance score, aroma score, palate score, and taste score (rating of the product varies

from 1 to 5), and detailed review texts. Our goal is to classify each beer to the right

style (Ale vs. Not Ale) by using customer reviews. The graph representation for re-

views is similar to the sub-graphs in DBLP dataset. Each review is represented as

a super-node. The edge between super-nodes are built using following method: Be-

cause a review has four rating scores in appearance, aroma, palate, and taste, we use

2http://beeradvocate.com/

99

Figure 5-5: Super-graph and comparison graph representations.

these four scores as a feature vector for each review. If two reviews’ distance in the

feature space (Euclidean distance) is less than 2, an edge is used to link two reviews

(i.e super-nodes). We choose 1000 beer products, half from Ale and the rest is from

Large and Hybrid Style, to form 1000 super-graphs for classification.


Baseline Methods: Because no existing method can handle super-graph classification,

for comparison purposes, we use two approaches to generate traditional graph representa-

tions from each super-graph: (1) randomly selecting one attribute (i.e one word) in each

super-node (and ignoring all other words) to form a single-attribute graph, as shown in

Fig. 5-5 (C). We repeat random selection for five times with each time generating a set of

single-attribute graphs from super-graphs. Each single-attribute graph set is used to train

a classifier and their majority voted accuracy on test super-graphs is reported in the ex-

periments. (2) We use the attribute set of the single-attribute graph in each super-node as

multi-attributes of the super-node to generate an attributed graph (which is equal to remov-

ing all the edges from the super-graph as shown in Fig. 5-5 (B)). For all the two graph

representations, we use the traditional random walk kernel (RWK) to measure the similar-

ity between any two graphs [25].

We use 10 times 10-fold cross-validation classification accuracy to measure and com-

pare the algorithm performance. To train classifiers from graph data, we use Naive Bayes

(NB), Decision Tree (DT), Support Vector Machines (SVM) and Nearest Neighbor algo-

rithm (NN). Majority examples are based on the kernel parameter settings: λ = 100 and

100

γ = 100.

All experiments are conducted on a machine with 4GB RAM and Intel CoreTM i5 3.10

GHZ CPU.

5.7.3 Results and Analysis

Performance on standard benchmarks: In Fig. 5-6, we report the classification accu-

racy on the benchmark datasets. For DBLP dataset, the number of super-nodes in each

super-graph varies within 1-10 and the edge-cutting threshold for single-attribute graph is

0.001. For Beer Review dataset, the number of super-nodes in each super-graph varies

within 30-60 and the edge-cutting threshold is 0.001. WRWK uses super-graph repre-

sentation (Fig.5-5(A)), RWK1 and RWK2 use traditional random walk kernel method

with attributed graph representation (Fig. 5-5(B)) and single-attribute graph representation

(Fig. 5-5(C)), respectively.. The experimental results show that WRWK constantly out-

performs traditional RWK method, regardless of the type of graph representations used

by RWK. This is mainly because in traditional graph representations each node only has

one attribute or a set of independent attributes, whereas single-attribute nodes (or multiple

independent attributes) cannot precisely describe the node content. For super-graphs, the

graph associated to each super-node provides an effective way to describe the node content.

In WRWK method, we consider the similarity between any two super-nodes as the weight

value instead of just using 1/0 hard matching to represent whether there is an intersection

of attribute sets between two nodes. The soft matching node similarities between super-

nodes capture the relationship between two graphs in a more precise way. This, in turn,

helps improve the classification accuracy. What is more, from Fig. 5-6, we can see that the

accuracy increases with the increasing information content in nodes.

Performance under different super-graph structures: To demonstrate the performance

of our WRWK method on super-graphs with different characteristics, we construct super-

graphs on Beer Review dataset by using different super-node sizes and different structures

of single-attribute graph (the structure of the single-attribute node is controlled by the edge-

cutting threshold) as shown in Fig. 5-7 (a)-(d). Figures correspond to three types of super-

101

(a) (b)

Figure 5-6: Classification accuracy on DBLP and Beer Review datasets w.r.t. different

classification methods (NB, DT, SVM, and NN).

graph structure on Beer Review. Data 1: the number of super-nodes in each super-graph

is 2-30 and the edge-cutting threshold is 0.00001. Data 2: the number of super-nodes in

each super-graph is 30-60 and the edge-cutting threshold is 0.001. Data 3: the number of

super-nodes in each super-graph is > 60 and the threshold is 0.1. The result shows that our

WRWK method is stable on super-graphs with different structures.

Performance w.r.t. changes in super-nodes and walks: The proposed weighted random

walk kernel relies on the similarity between super-nodes and the common walks between

two super-graphs to calculate the graph similarity. This raises a concern on whether super-

node similarity or walk similarity (or both) plays a more important role in assessing the

super-graph similarity.

In order to resolve this concern, we design the following experiments. In the first set

of experiments (Fig. 5-8 (a) and (b)), we fix super-graph edges, and change edges inside

super-nodes, which will impact on the super-node similarities. If this results in significant

changes, it means that super-node similarity plays a more important role. In the second

set of experiments (Fig. 5-8 (c)), we fix super-nodes but vary the edges in super-graphs

(by randomly removing edges), which will impact on the common walks between super-

graphs. If this results in significant changes, it means that walk plays a more important role

than super-nodes.

Fig. 5-8 (a) and (b) report the algorithm performance with respect to the edge-cutting

102

(a) (b)

(c) (d)

Figure 5-7: Classification accuracy on Beer Review dataset w.r.t. different datasets and

classification methods (NB, DT, SVM, and NN).

threshold. The accuracy decreases dramatically with the increase of the edge-cutting thresh-

old for both datasets. When the edge-cutting threshold is set to 0.0001 on DBLP and

0.00001 on Beer Review, the single-attribute graph in each super-node is almost a com-

plete graph and four methods achieve the highest classification accuracy. As the threshold

is set to 0.01 on DBLP and 0.1 on Beer Review, the single-attribute graph in each super-

node is very small and contains very few edges. As a result, the accuracies are just around

65%. This demonstrates that WRWK heavily relies on the structure information of each

super-node to assess the super-graph similarities.

Fig. 5-8 (c) reports the algorithm performance by fixing the super-nodes but gradually

removing edges in the super-graph of Beer Review dataset (as Fig. 5-6 (b)). In (a) and

(b), the node size of super-graph varies from 0-10 on DBLP dataset and 30-60 on Beer

Review dataset. The x-axis shows the value of the edge-cutting threshold and the average

103

(a) (b)

0 25 50 75 100

63

68

73

Performance on Beer data set

Average number of cutting edges

Acc

urac

y %

NBDTSVMNN

(c)

Figure 5-8: The performance w.r.t. different edge-cutting thresholds on DBLP and Beer

Review datasets by using WRWK method.

node degree in its corresponding single-attribute graph is reported in the parentheses. In

(c), the average number of edges cut from each super-graph varies from 0 to 100. Simi-

lar to the result in Fig. 5-8 (a) and (b), the classification accuracy decreases when edges

are continuously removed from the super-graph (even if the super-nodes are fixed). From

Figs. 5-8, we find that graph structure of super-graph is as important as that of super-nodes.

This is mainly attributed to the fact that our weighted random walk kernel relies on both

super-node similarity and walks in the super-graph to calculate graph similarities.

104

Chapter 6

Streaming Network Node Classification

6.1 INTRODUCTION

Recent years have witnessed an increasing number of applications involving networked

data, where instances are not only characterized by the content but are also subject to de-

pendency relationships. For example, each node in a social network can denote one person

and links between nodes can represent their social interactions. For each node in the net-

work, its content can be described using features, such as the bag-of-word representation of

the user’s posts. The mixed node content and structure information raise many unique data

mining tasks, such as network node classification [3] where the goal is to combine node

content and network topology structures to classify unlabeled nodes in the network with a

maximal accuracy. Applications of network node classification include social spammer de-

tection [70], automatic email categorization [52], inferring personality from social network

structures [68], and image classification using social networks [51].

When classifying nodes in networks, existing methods can be roughly categorized into

three groups: (1) combining content and structure features into new feature vector repre-

sentation, such as iterative collective classification [61], and link-based classification [49];

(2) using network paths, such as random walks [18], to determine node labels; and (3)

using content information to build additional structure nodes and generate a new topology

network for classification [2]. The theme of all these methods is to leverage node content

and topology structures to infer correct labels for unlabeled nodes.

105

Figure 6-1: An example of streaming networks, where each color bar denotes a feature.

Existing node classification methods are carried out in a static network setting, with-

out taking evolving network structures and node concepts into consideration. In reality,

changes are essential components in real-world networks, mainly because user participa-

tion, interactions, and responses to external factors continuously introduce new nodes and

edges to the network. In addition, a user may add/delete/modify online posts, which natu-

rally result in changes in the node content. As a result, the networks are inherently dynamic.

In this thesis, we refer to this type of networks, where the network structures and node con-

tent are continuous changing, as Streaming Networks. An example of streaming networks

is shown in Fig. 6-1, where the topology structures and the node feature distributions are

constantly changing with time. Specifically, at time point t2, new nodes (e.g., 4 and 5) and

relevant edges join the network; At t3, Node 5 and the edge between Nodes 1 and 2 are

removed. Over the whole period, node content may continuously change (e.g., the content

change in Node 3). Accurate node classification in a streaming network setting is therefore

much more challenging than static networks. Because existing node classification methods

on static networks cannot capture the changing information which may cause different clas-

sification results, while if we use existing node classification methods repeatedly for each

time point, it is time-consuming. In summary, node classification in streaming networks

has at least three major challenges:

• Streaming network structures: Network topology structures encode rich informa-

tion about node interactions inside the network, which should be considered for node

classification. In streaming networks, topology structures are constantly changing, so

106

node classification needs to rapidly capture and adapt to such changes for maximal

accuracy gain.

• Streaming node features: For each node in a streaming network, its content may

constantly evolve (w.r.t. users posts or profile updating). As a result, the feature space

used to denote the node content is dynamically changing, resulting in streaming fea-

tures [78] with infinite feature space. To capture changes, a feature selection method

should timely select the most effective features to ensure that node classification can

quickly adapt to the new network.

• Continuous Changing Network Volumes: Because node volumes of streaming net-

works are continuously changing, node classification needs to scale to the dynamic

increasing of the network volumes by utilizing models discovered from historical

data to boost the learning on new data.

For streaming networks, changes are introduced through two major channels (1) node

content; and (2) topology structures. To achieve a high node classification accuracy, a

fundamental issue is how to properly characterize such changes. In this thesis, we propose

to address this issue using a feature driven framework, which uses node content features

to model and capture network changes for classification. Fig. 6-2 demonstrates how node

features can be used to characterize changes in the network. More specifically, nodes and

edges with solid lines denote network observed at time point t, while dashed circles and

edges mean networked data arriving at t+1. Nodes and edges with curved lines are removed

at t + 1, and the underlined features (keywords) are also removed at t + 1. Nodes are

colored based on their class labels, and white nodes mean unlabeled nodes. At time point

t, keywords “System" and “Network" are selected as features to represent node content

(assuming the number of features is limited to 2) and classify nodes into two classes (red

vs. cyan). At t + 1, the network changes incurring new structures and node content. By

updating the selected features and using “Spectrum" to replace “System", the new feature

space {“Network”, “Spectrum”} can effectively classify unlabeled nodes into right classes.

The above observations motivate the proposed research that uses features to capture

changes in streaming networks for node classification. When a network is experiencing

107

Figure 6-2: An example of using feature selection to capture changes in a streaming net-

work (keywords inside each node denote node content).

changes in the topology structures and node content, we can try to identify a set of important

features that can best reveal such changed network structures and node content. Because

in a networked world, nodes close to each other in the network topology structure space

tend to share common content information [26], we can use selected features to design a

“similarity gauging” procedure to assess the consistency of the network node content and

structures in order to determine the labels of unlabeled nodes. A smaller gauging value

indicates that the node content and structures has a better alignment with the node label. So

the gauging based classification is carried out such that for an unlabeled node, its label is

the class which results in the minimal gauging value with respect to the identified features.

By updating the selected features, the node classification can automatically adapt to the

changes in the streaming network for maximal accuracy gain.

The main contribution of the chapter, compared to existing works, is twofold:

• Streaming Network Node Classification: We propose a new streaming network

node classification (SNOC) method that takes node content and structure similarity

into consideration to find important features to model changes in the network for

node classification. This method is not only more accurate than existing node clas-

sification approaches, but is also effective to capture changes in networks for node

classification.

• Streaming Network Feature Selection: We introduce a novel streaming network

108

feature selection framework, SNF, for streaming networks. To ensure feature evalu-

ation can timely adapt to the changes in the network, SNF can incrementally update

the evaluation score of an existing feature by accumulating changes in the network.

This allows our method to effectively handle streaming networks with changing fea-

ture space and feature distributions for better runtime and performance gain.

The remainder of the chapter is structured as follows. The problem definition and the

overall framework are introduced in Sect. 6.2. Sect. 6.3 introduces the proposed node

classification method, followed by experiments in Sec. 6.4.

6.2 PROBLEM DEFINITION AND FRAMEWORK

A streaming network contains a dynamic number of nodes and edges, and the node content

may also change in terms of new features or new feature values. At a given time point t,

the network nodes are denoted by X = {(xi, yi)}nti=1, where xi ∈ R

dt is the original feature

vector of node i, and yi ∈ Y = {0, 1, 2, . . . , c} is the label of node i. nt and dt denote the

number of nodes and the dimensionality of the node feature space at time point t, which

may vary with time. Specificially, yi = 0 means that node i is unlabeled. A ∈ Rnt×nt is the

adjacency matrix of the networked data, where Aij = 1 if there is an edge (link) between

nodes i and j, and Aij = 0 otherwise. A path Pij between nodes i and j is a sequence

of edges, starting at i, ending at j. The length of a path is the number of edges on it. For

each adjacency matrix A, the element [Ak]ij of the kth power matrix denotes the number of

length-k paths from i to j in the network [25].

To represent network node content, we use F = {f 1, . . . , f r, . . . , fdt} to denote node

feature space at time point t, where the feature dimension dt dramatically changes with time

t. We use X = [x1, x2, . . . , xnt ] = [f1, f2, . . . , fdt ]� ∈ Rdt×nt to represent the data matrix,

and C ∈ Rnt×nt represents the label relationship matrix of the networked data, where

Cij = 1 means nodes i and j are in the same class, and Cij = 0 otherwise. We use f r to

denote a feature, and use bold-faced fr to represent indicator vector of feature f r, where

[fr]j records the actual value of feature f r in node j. In a binary feature representation (such

109

as the bag-of-word for text), we have [fr]j = 1 if feature f r appears in node j, and [fr]j = 0,

otherwise. Obviously, fr helps capture the distribution of feature f r in the network.

Streaming network node classification aims to classify unlabeled nodes in the network,

at any time point t, with maximal accuracy. To capture dynamic changes of the network,

we propose to use feature selection to timely discover a feature subset S of size m from

F . When discovering feature set S, both the node content and the network structures are

combined to find the most informative features at each single time point t. As a result, the

node classification can adapt to the changes in the network to achieve maximal accuracy.

Fig. 6-3 shows the framework of our streaming network node classification method

(SNOC). More specifically, Panel A: at time point t, the network is denoted by nodes and

edges with solid lines. Dashed nodes and edges denote new nodes and edges arriving at

time point t+ 1. Colour bars in nodes means different features and curved bars means that

a feature appeared at t but is removed at t + 1, such as the purple bar in Node 3. Nodes

and edges with curved lines means they exist at t, but are removed at t + 1, like Node 4.

Panel B: at time point t, candidate features and selected features are identified based on

F-score qt(fr). At time point t + 1, streaming network feature selection (SNF) updates

the scores of old features (candidate features at t), and also calculates feature scores for

new features. Panel C: At time point t + 1, SNOC uses selected features as gauge to test

whether to classify an unlabeled Node 5 as positive (+) or negative (-). The one with the

smallest gauging value is used to label Node 5. To capture changes in streaming networks,

an incremental streaming network feature selection method, SNF, is proposed to timely

discover a set of most informative features in the network. To classify unlabeled nodes,

SNOC takes both label similarity and structure similarity into consideration and uses a

quality criterion to find most suitable label for an unlabeled node.

6.3 THE PROPOSED METHOD

To classify unlabeled nodes in a streaming network, our theme is to let (1) nodes sharing

the same class and having a high structure similarity be close to each other, and (2) nodes

belonging to different classes and having a weak structure relationship be far away from

110

Figure 6-3: The framework of the proposed streaming network node classification (SNOC)

method.

each other. This is motivated by the commonly observed phenomenon [26] that nodes close

to each other in network topology structures tend to share common content information. For

example, friends in the same cohort group are more likely to share similar experiences or

interests, and a paper and its references often contain relevant research subjects/topics. Our

proposed theme is also consistent with the relational collective inference modeling [34] that

uses relationships between class labels and attributes of neighboring objects in the network

for classification.

Following the above theme, we can regard streaming network node classification as

an optimization problem, which tries to find the optimal assignment of the class labels to

unlabeled node set X u, such that the assigned class labels Yu ⊆ Y can result in the whole

network to maximally comply with the proposed theme, as defined in Eq.(6.1).

Yu∗ = argminYu⊆Y

E(Yu) (6.1)

where Yu is an assignment of labels to unlabeled nodes in the network, and Yu∗ is the

optimal assignment, which results in the minimal utility score E(Yu).

Following the node classification objective function in Eq.(6.1), the key question is

how to properly define utility function E(·). Clearly, the node content provides valuable

information to determine the label of each node, so we should define E(·) based on the

node content Feature Space. In streaming networks, the feature space used to denote

the node content is continuously changing with new features or updated feature values.

Using all features to represent the network is clearly suboptimal. If a set of good features

111

can be found to capture changes in the network, the node classification in Eq.(6.1) will

automatically adapt to the changes in the network for maximal accuracy. So Eq.(6.1) is

re-written as

Yu∗ = argminYu⊆Y

E(Yu,S) (6.2)

where S is the selected feature set used to capture changes in a streaming network.

Because the utility function E(Yu,S) is constrained by the selected features S, finding

the optimal S becomes the next challenge. Obviously, a good S should properly capture

network node relationships in terms of the node content, node labels, and node topology

structures. That means the node relationships in Feature Space should follow our proposed

theme with the node similarity being impacted by (1) the label-based similarity in the Label

Space; and (2) the structure-based similarity in the Structure Space.

Accordingly, node classification in Eq. (6.2) can be divided into two major steps: (1)

Finding an optimal feature set S; (2) Finding an optimal assignment of labels to unlabeled

nodes such that the utility score E(Yu,S) calculated based to the selected features S has the

minimal value. Therefore, we derive an updated evaluation criterion E(Yu,S) as follows:

E(Yu,S) = 1

2

∑i∈Xu

∑j∈X

h(i, j, yi)(DSxi −DSxj)2

s.t. min(1

2

∑i,j∈X

h(i, j)(DSxi −DSxj)2),S ⊆ F , |S| = m

(6.3)

where h(i, j) is the similarity between nodes i and j in the network structure space

that will be formally defined in Eq.(6.8). h(i, j, yi) is the similarity between nodes i and j

conditioned by setting the label of unlabeled node i as yi. In Eq.(6.3), (DSxi−DSxj)2 mea-

sures the feature based distance between nodes i and j w.r.t. the current selected features

S . DS is a diagonal matrix indicating features that are selected into the selected feature set

S (from F), where

[DS ]ij =

{1, if i = j and f i ∈ S;

0, otherwise.

In Eq.(6.3), we use network structure similarity h(i, j, yi) as the weight value of the

112

node feature distance (DSxi − DSxj)2. If nodes i and j have a high structure similarity,

their feature distance will have a large weight value and therefore plays a more important

role in the objective function. In an extreme case, if nodes i and j have a zero structure

similarity, their feature distance will not have any impact on the objective function at all. By

doing so, we can effectively combine node structure similarity and node content distance

to assess the consistency of the whole network.

For streaming networks, the selected feature set S should be dynamically updated to

capture changes in the network. By using dynamic feature set S to guide the node classi-

fication, Eq.(6.3) provides an efficient way to classify nodes in dynamic networks. This is

mainly because that any significant changes in the network will be captured by S , and by

using S as gauge for node classification, our method can automatically adapt to the changes

in the network structures and node content.

The solutions to the objective function in Eq.(6.3) require optimization for both vari-

ables (DS and Yu). To solve Eq.(6.3), we divide the process into two parts: (1) propose a

novel streaming network feature selection framework, SNF, to take both network structures

and node labels into consideration to find optimal feature set S; and (2) propose an Lapla-

cian based quality criterion to grade an unlabeled node with respect to different labels by

using S as the gauge. Finally, the node classification is achieved by finding best labels that

result in the minimal gauging values.

6.3.1 Streaming Network Feature Selection

Given a streaming network, the network observed at a single time point t can be consid-

ered as a static network. In this subsection, we first introduce feature selection on a static

network, and then extend to streaming networks.

Feature Selection on a Static Network

We first define feature selection as an optimization problem. Our target is to find an optimal

set of features, which can best represent the network node content and structures.

The network edges and node labels both play important, yet different, roles for node

113

classification. We assume that the optimal feature set should have the following properties:

• Label Similarity: a) labeled nodes in the same class should be close to each other,

and labeled nodes in different classes should be far away from each other; b) unla-

beled nodes should be separated from each other.

• Structure Similarity: The structure similarity between nodes i and j is closely tied

to the number of paths and the path length between them. The more the number of

length-l paths between i and j, the higher their structure similarity is. The shorter the

path length between two nodes, the higher their structure similarity is.

Note that Item b) in the first bullet incorporates the distributions of unlabeled nodes,

and tends to select features that can separate nodes far from each other. It is similar to

the assumption of the Principle Component Analysis, which is expressed as the average

squared distance between unlabeled samples [88]. Item b) intends to disfavor features that

are too rare or too frequent in the data set, because unlabeled nodes cannot be separated

from each other using these features [42].

The above two properties can be formalized as follows:

(1) Minimizing Label Similarity Objective Function:

JL(fr) =

1

2

∑Cij=1

(V�r xi − V�

r xj)2 −1

2c

∑Cij=0

(V�r xi − V�

r xj)2 (6.4)

where c is the total number of classes, and Vr is an indicating vector showing that

feature is selected, and its definition is as [Vr]i = 1 if i = r, and [Vr]i = 0 otherwise.

(2) Minimizing Structure Similarity Objective Function:

JS(fr) =

1

2

nt∑i,j=1

Θij(V�r xi − V�

r xj)2 (6.5)

where Θij in Eq.(6.5) means the l-maximal length path weight parameter between

nodes i and j, which is defined as follows:

Θ =l∑

i=1

1

2i−1Ai (6.6)

114

Figure 6-4: An example of using feature selection to capture structure similarity.

The number of paths between two nodes is a proved good indicator of the node structure

similarity. The shorter the path between two nodes, the closer the two nodes are in structure.

So the weight in Eq. (6.6) will decrease with the increase of the path length. An example

is shown in Fig. 6-4. More specifically, left panel shows the network in original feature

space and right panel shows the network in selected feature space (which contains m = 6

features). On the right panel, Node 1 shares more paths with Node 3 than with Node 7, and

the paths between Node 1 and Node 3 are shorter than the ones between Node 1 and Node

7. So Nodes 1 and 3 are closer to each other than Nodes 1 and 7 from structure similarity

perspective. The structure similarity is tied to the representation of the nodes in selected

feature space. If two nodes have an edge, they will be close to each other in the selected

feature space (e.g. Node 1 and Node 5 have one edge, so they have three common features

in selected feature space). With the increase of the path length between two nodes, the

distance of the nodes in feature space also increase (e.g. Node 1 and Node 6 have no edge,

so they share no common feature in selected feature space).

In Eq. (6.5), Θ is used as a penalty factor for two nodes that have high structure simi-

larity but are far away from each other in feature space. Intuitively, nodes close in topology

structure have a high probability of sharing similar node content [26]. So if any two nodes

i and j are close to each other in topology structure but have a large distance in the original

feature space, their Θij value will increase the objective value and thus encourages feature

selection module to find similar features for i and j. This provides a unique way to impose

network topology structures into the node feature selection process.

By combining the label similarity objective function in Eq. (6.4) and structure similarity

115

objective function in Eq. (6.5), we can form a combined evaluation criterion for each feature

f r as follows:

J (f r) = ξ · JL(fr) + (1− ξ) · JS(f

r) (6.7)

where ξ (0 ≤ ξ ≤ 1) is the weight parameter used to balance the contributions of

network structures and node labels. The ξ values allow users to fine tune structure and label

similarity in the feature selection for networks from different domains. In Section 6.4, we

will report the algorithm performance w.r.t. different ξ values on benchmark networks. An

example of using feature selection to capture structure similarity is also shown in Figure 6-

4.

By defining a weighted matrix W = [Wij]nt×nt as

Wij = [ξ, ξ/c] · [Cij,Cij − 1]� + (1− ξ) ·Θij (6.8)

we can rewrite Eq. (6.7) as follows:

J (f r) =1

2

nt∑i,j=1

(V�r xi − V�

r xj)2Wij

=1

2

nt∑i,j=1

(fri − frj)2Wij

= (fr)�Dfr − (fr)�Wfr = (fr)�Lfr

(6.9)

where D is a diagonal matrix whose entries are column sums of W, i.e., Dii =∑

j Wij .

L = D − W is a Laplacian matrix.

In Eq. (6.8), Wij is equal to the structure similarity matrix h(i, j) in Eq. (6.3), so the

constraint part in Eq. (6.3) is equal to minimizing∑

fr∈S J (f r).

As a result, the problem of feature selection in a static network is equal to finding a

subset S containing m features that satisfy:

min∑fr∈S

J (f r), s.t. S ⊆ F , |S| = m (6.10)

DEFINITION 22 (F-Score) Let X = [f1, f2, . . . , fdt ]� represents the networked data, and

W is a matrix defined as Eq. (6.8). L is a Laplacian matrix defined as L = D − W, where

116

D is a diagonal matrix, Dii =∑

j Wij . We define a quality criterion q called F-Score, for

a feature f r as

q(f r) = (fr)�Lfr (6.11)

The solution to Eq. (6.10) can be found by using F-Score to assess features in the

original feature space F . Suppose the F-Score for all features are denoted by q(f 1) ≤q(f 2) ≤ · · · ≤ q(fdt) in a sorted order, the solution of finding the m most informative

features is

S = {f r| r ≤ m} (6.12)

Feature Selection on Streaming Networks

When the network continuously evolves at different time points T = {t1, t2, . . . }, network

structure, including edges and nodes, and node features may change accordingly. So we

need to adjust the selected feature set S in order to characterize changed network. Com-

pletely rerunning the feature selection at each single time point from the scratch is time

consuming, especially for large size networks. In this section, we introduce an incremen-

tal feature selection method, which calculates the score of an old feature based on new

networked data and then combines it with the old feature scores to update the feature’s

final score. Such an incremental feature selection process ensures our method to tackle the

“Continuous Changing Network Volumes” challenge.

To incrementally update scores for old features, we separate networked data into two

parts: a) nodes and edges that already exist at time point t; and b) new emerged (or dis-

appeared) nodes and their relevant topology structures at t + 1. After that, we use Part

a) to get the changing parts of feature distributions in the old networks and use Part b) to

calculate local incremental scores and update the scores of existing features, respectively.

If the changed score of an old feature at t + 1 can be obtained by using Part a) and Part b)

efficiently, we can compute a feature score by combining its old score at t and the changed

score at t+ 1.

For ease of representation, we define following notations:

• A subscript t or t + 1 of each matrix (or a vector) means the time point t or t + 1 of

117

the matrix (or vector).

• frt and frt+1 denote indicator vectors of feature f r in the network at time point t and

t + 1, respectively, where frt ∈ Rnt×1 and frt+1 ∈ R

nt+1×1. Then we define fr′

t+1 ∈R

nt×1 as [fr′

t+1]i = [frt+1]i, where 1 ≤ i ≤ nt.

• Δn denotes the number of new arrived nodes (from time point t to t+ 1).

• Wo denotes the weight matrix defined in Eq.(6.8) between new nodes arrived at time

point t+ 1 and old nodes that already existed at time point t.

• Wc denotes the changed weight matrix from time point t to t+ 1 between old nodes

that already existed at time point t.

• WΔn denotes the weight matrix between new nodes that arrived at time point t+ 1.

So the weight matrix of the networked data at time point t+1 is Wt+1, and the updated

part between t and t+ 1 is WΔt+1, i.e.

Wt+1 =

⎡⎢⎣ Wt + Wc Wo

W�o WΔn

⎤⎥⎦ , WΔt+1 =

⎡⎢⎣ Wc Wo

W�o WΔn

⎤⎥⎦.Then the F -Score of an old feature f r at time point t+ 1 is,

qt+1(fr)

=1

2

nt+1∑i,j=1

([frt+1]i − [frt+1]j)2[Wt+1]ij

=1

2

nt+1∑i,j=1

([frt+1]i − [frt+1]j)2(

⎡⎢⎢⎣ Wt 0

0 0

⎤⎥⎥⎦ij

+ [WΔt+1]ij)

= (fr′

t+1)�Ltfr

′t+1 +

1

2

nt+1∑i,j=1

([frt+1]i − [frt+1]j)2[WΔ

t+1]ij

= (frt )�Ltfrt + (fr

′t+1 − frt )

�Lt(fr′

t+1 − frt ) + (frt+1)�LΔ

t+1frt+1

(6.13)

In Eq.(6.13), qt+1(fr) contains three parts. The first term is qt(f

r), and the last two terms

are the changed scores at t + 1, which correspond to Part a) and Part b), respectively.

118

Formally,

qΔt+1(fr) = (fr

′t+1 − frt )

�Lt(fr′

t+1 − frt ) + (frt+1)�LΔ

t+1frt+1 (6.14)

where LΔt+1 is the Laplacian matrix of WΔ

t+1. We calculate WΔt+1 by using changed part of

the network (including nodes and edges) as follows:

WΔt+1 = ξWΔ(L)

t+1 + (1− ξ)WΔ(S)t+1

(6.15)

WΔ(L)t+1 and WΔ(S)

t+1 are used to calculate the changed parts of label relationships and structure

relationships, respectively, from time point t to t+ 1.

WΔ(L)t+1 =

{([1, 1/c] · [Cij,Cij − 1]�, if i or j ∈ Δn;

0, otherwise.

and

WΔ(S)t+1 = Θt+1 −Θt,

[WΔ(L)t+1 ]ij denotes the incremental weight parameter of label similarity between nodes i

and j, and [WΔ(S)t+1 ]ij is the incremental l-length path weight parameter between nodes i

and j. Both of them are “incrementally" calculated by only using the changed parts of the

streaming networks.

As a result, we can obtain a new score qt+1(fr) by adding qΔt+1(f

r) to qt(fr), with

the new scores of old features being used in the final feature selection process. When

the streaming networks change with time, an old feature f r’s new score, qΔt+1(fr), can be

incrementally calculated by using the changed part of the network at t + 1 compared to

qt(fr) time point t, which allows SNF to efficiently update feature scores for large scale

dynamic networks.

For streaming features with an infinite feature space, it is infeasible to keep all fea-

ture scores for future comparison. So SNF maintains a small feature set, called candi-

date feature set, for future comparisons, i.e. T = {f 1, f 2, . . . , fm, fm+1, . . . , fk}, where,

q(f 1) ≤ q(f 2) ≤ · · · ≤ q(fk) and q(fk) ≤ 2q(fm). This setting ensures that the discarded

features are very unlikely to be selected at the next time point. So SNF always keeps a

119

Algorithm 5 SNF: Steaming network feature selection

Input: (1) the network at time points t and t + 1: Xt and Xt+1, (2) candidate feature set:

Tt, (3) F-Score list of Tt: Ht, (4) size of selected feature set: m, and (5) new feature

set Vt+1.

Output: selected feature set: St+1 and candidate feature set Tt+1.

1: Initialize the score list Ht+1 and generate the updated Laplacian matrix LΔt+1;

2: // calculate F-Score for new features

3: for f r ∈ Vt+1 do4: q(f r) ← (fr)�Lt+1fr

5: Ht+1 ← q(f r) ∪Ht+1

6: end for7: // update F-Score for old features

8: for f r ∈ Tt do9: qΔt+1(f

r) ← (fr′

t+1 − frt )�Lt(fr′

t+1 − frt ) + (frt+1)�LΔ

t+1frt+1

10: qt(fr) ← Ht(f

r)11: qt+1(f

r) ← qt(fr) + qΔt+1(f

r)12: Ht+1 ← q(f r) ∪Ht+1

13: end for14: Sort Ht+1 in ascending order

15: St+1 ← top-m of Ht+1

16: Tt+1 ← top-k of Ht+1, where q(fk) ≤ 2q(fm)

candidate feature set with dynamic size k, and discard less informative features. For all

new features appearing in the new nodes, SNF will calculate their feature scores in order to

ensure that important new features can be discovered immediately after they emerge in the

network.

Algorithm 5 lists the detailed SNF algorithm, which incrementally compares scores of

new features and old features in T and selects top-m features to form the final feature set.

It is worth noting that the proposed SNF method can efficiently handle three types of

changes in streaming networks: (1) Feature distribution changes: For each feature f r, if

its distributions change from f rt to f r

t+1, the first part of Eq. (6.14) is used to calculate its

changed score; (2) Node addition and structure changes: For new nodes and their associ-

ated edge connections, the second part of Eq. (6.14) will capture the topological structure

changes and the node addition information; and (3) Node deletion: For nodes that are re-

moved at t+1, we can set their feature indicators to 0 to indicate that the nodes have empty

node content, and then use (1) to update feature scores.

Time Complexity Analysis: the time complexity of getting Wt is O(n3lt ). So the total

120

time complexity of getting qt(fr) in Algorithm 5 is O(n3l

t ). While if the size of WΔt+1 is

nΔt+1×nΔ

t+1, the complexity of getting qΔt+1(fr) in Algorithm 5 is O([nΔ

t+1]3l). In most cases,

nΔt+1 << nt. So our updated method is very efficient under streaming network setting.

6.3.2 Node Classification on Streaming Networks

Once the most informative m features are identified at time point t (denoted by St =

{f 1, f 2, · · · , fm}), the network nodes can be represented by using selected features as:

Xt = [x1, x2, . . . , xnt ] ⇒ XSt = [f1, f2, . . . , fm]� ∈ Rm×nt ,

The node classification is to provide accurate labels for unlabeled nodes in the network

at time point t. If an unlabeled node u is correctly labeled, u should be right positioned to

other nodes w.r.t. the label and structure similarity as defined in Eq.(6.3). So the quality

criterion of Eq.(6.3) for each unlabeled node u at a given time point t can be rewritten as

follows,

E(yu,St) =1

2

nt∑i,j=1

(DStxi −DStxj)2[Wyu ]ij (6.16)

where yu ∈ Y , and Wyu means the weight matrix generated from Eq.(6.8) by setting

the label of u to yu. DSt is a diagonal matrix indicating features that are selected into the

feature set from F to St.

Because the quality criterion is only affected by the changed part of weight matrix, and

the changes in node labels only affect the label similarity part, we can define the changed

weight matrix as:

[WΔyu ]ij =

⎧⎪⎪⎨⎪⎪⎩1, if j = u, yi = yu or i = u, yj = yu;

−1c, if j = u, yi = yu or i = u, yj = yu;

0, if j = u and i = u. (6.17)

So the quality criterion in Eq.(6.16) can be replaced by

E ′(yu,St) =1

2

nt∑i,j=1

(DStxi −DStxj)2[WΔyu ]ij (6.18)

121

Algorithm 6 SNOC: Streaming Network Node Classification

Input: (1) the network: Xt and Xt−1, (2) label list: Yt, (3) candidate feature set at point

t − 1:Tt−1, (4) F-Score list of Tt−1: Ht−1, (5) size of selected feature set: m, and (6)

new feature set Vt.

Output: label list for unlabeled data: Yut .

1: (St, Tt) ← SNF (Xt,Xt−1, Tt−1,Ht−1,Vt,m)2: Mapping Xt into XSt by using St;

3: for each unlabeled node u do4: y∗u = argmin

yu∈Y(tr([XSt ]�LΔyuXSt))

5: end for

Then we can calculate E(yu,St) as

E ′(yu,St) =1

2

nt∑i,j=1

(DStxi −DStxj)2[WΔyu ]ij

= tr(D�St

Xt(DΔyu − WΔyu)X�t DSt)

= tr(D�St

XtLΔyuX�t DSt)

= tr([XSt ]�LΔyuXSt)

(6.19)

where tr(·) is the trace of a matrix, and LΔyu is the Laplacian matrix of WΔyu .

So our target is to select a label for an unlabeled node u to ensure:

minyu∈Y

E ′(yu,St) (6.20)

DEFINITION 23 (SNC) Let XSt = [f1, f2, . . . , fm]� represents the mapped network nodes

in the selected feature space. Suppose WΔyu is a matrix defined as Eq. (6.17). LΔyu is a

Laplacian matrix defined as LΔyu = DΔyu − WΔyu , where DΔyu is a diagonal matrix,

[DΔyu ]ii =∑

j [WΔyu ]ij . We define a labeling criterion, called streaming network criterion

SNC, for each unlabeled node u as follows,

y∗u = argminyu∈Y

(tr([XSt ]�LΔyuXSt)) (6.21)

Through the SNC criterion, Eq. (6.1) can be achieved by calculating yu for each single

unlabeled node. Algorithm 6 lists the detailed process of the proposed streaming network

122

node classification (SNOC) method, which uses SNF to select a feature space to capture

network changes and then assigns different labels to each unlabeled node by using selected

features as the gauge. The class label of an unlabeled node is the one that results in the

minimal gauging value with respect to the selected features.

6.4 EXPERIMENTS

In this section, we conduct extensive experiments to evaluate the efficiency and effective-

ness of SNOC for node classification in static and streaming networks.


We validate the performance of SNOC on the following four real-world networks.

Cora1 is a citation network with 2,708 publications (i.e. nodes) classified into one

of seven classes. The citation relationships are captured in 5,429 links. The node

content is described by a 0/1-valued word vector indicating the absence/presence of

the corresponding word from a dictionary of 1,433 unique words.

CiteSeer1 consists of 3,312 scientific publications classified into one of six classes.

The network consists of 4,732 links. Each publication in CiteSeer is described by a

0/1-valued word vector from a dictionary with 3,703 unique words.

PubMed Diabetes1 network consists of 19,717 publications (nodes) from PubMed

database pertaining to three types of diabetes. It has 44,338 links. Each paper is de-

scribed by a TF-IDF weighted word vector from a dictionary containing 500 unique

words.

DBLP2 network contains 2,084,055 papers and 2,244,018 citation relationships. We

separate papers into six classes: DataBases, Artificial Intelligence, Hardware and

Architecture, Applications and Media, System Technology, and others. Each paper

is denoted by a 0/1-valued word vector from a dictionary with 3,000 words.

1http://linqs.cs.umd.edu/projects//projects/lbc/index.html2http://arnetminer.org/citation

123

To evaluate the performance of SNOC for streaming networks, we firstly test the algo-

rithm performance on static networks by using three networks (Cora, CiteSeer and PubMed

Diabetes). After that, we use DBLP and PubMed Diabetes networks as our streaming net-

work test bed (because the sizes of CiteSeer and Cora networks are too small for testing

in streaming network settings). DBLP is inherently a streaming network, because publi-

cations are continuously updated and top keywords (node features) are also continuously

changing with respect to the time. For DBLP network with streaming network setting, we

choose 2,000 publications for each year and build a streaming network covering the time

period from 1991 to 2010. In addition, we also use PubMed Diabetes network to simulate

a streaming network with 1,000 random nodes to be included for each time point t (the

experiments include a total of 15 time points).

For most experiments, we randomly label 40% of nodes in the network and use the

remaining nodes as test data (this is reasonable setting because real-world networks always

have more unlabeled nodes than the number of labeled ones). In addition, we also report the

algorithm performance with respect to different percentages of training/test nodes (detailed

in Fig. 6-6(c)). For streaming network experiments, the accuracy is tested on the new nodes

arrived at each time point. The default size of selected feature set is m = 100, the default

value of weight parameter ξ = 0.7, and the default maximal path length in Eq. 6.6 is set to

l = 3.

Baseline Methods: We compare the performance of SNOC with four baselines:

Information Gain+SVM (IG+SVM): This method ignores link structures in the network

and uses Information Gain (IG) to select the top-m features from all nodes (using content

information in the original bag-of-feature representation). LIBSVM [10] is used as the

learning algorithm to train classifiers for node classification.

Link Structure+SVM (LS+SVM): This method ignores label information of labeled nodes

and only uses structure similarity to construct the weight matrix (W) and then calculates

the feature score in a similar way as SNF. LIBSVM is also used as the learning algorithm

to train classifiers for node classification.

Collective Classification (GS+LR): This method refers to the combined classification of

interlinked objects including the correlation between node label and node content. In our

124

50 100 200 30020

25

30

35

40

45

50

55

60

65

70

# of selected features (m)

Acc

urac

y %

IG+SVMLS+SVMDYCOSGS+LRSNOC

(a) Cora

50 100 200 30030

40

50

60

70

80

90


Acc

urac

y %


(b) CiteSeer

50 100 200 30030

40

50

60

70

80

90

100


Acc

urac

y %


(c) PubMed Diabetes

Figure 6-5: The accuracies on three real-world static networks w.r.t. different numbers of

selected features (from 50 to 300).

experiments, we use collective classification method [61], which uses a simplified version

of Gibbs sampling (GS) as the approximate inference procedures for networked data, with

Logistic Regression (LR) being used as classifiers for node classification.

DYCOS: This is a recently proposed method that combines text content and links for net-

work node classification [2]. It is considered the state-of-the-art classification method in

streaming networks. A random walk approach in conjunction with the content of the net-

work is used for node classification. This results in a new approach to handle variations in

content and linkage structures. gini-index is used to select features in this method.

All experiments are conducted on a cluster machine with 16GB RAM and Intel CoreTM

i7 3.20 GHZ CPU.

6.4.2 Performance on Static Networks

Table I reports the performance of different methods on three static networks (Cora, Cite-

Seer and PubMed Diabetes). The results show that SNOC outperforms other four baseline

methods on all three networks with significant performance gain. This is mainly attributed

to SNOC’s integration of network topology structure and node labels to explore features

for node classification. Although DYCOS indeed considers network linkage information

and GS+LR considers the correlation between node labels and node content, they do not

take into account the impact of deep structure information for both feature selection and

classification process. So their performance is inferior to SNOC. Noticeably, even though

DYCOS takes structure information into account, the actual contributions of label similar-

125

1 2 3 4 5

40

50

60

70

80

90A

ccur

acy

%

CoraCiteSeerPubMed Diabetes

(a) Maximal Path Length l

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

30

40

50

60

70

80

90

Acc

urac

y %

CoraCiteSeerPubMed

(b) Weight Parameter ξ

10% 20% 30% 40% 50% 60% 70% 80% 90%20

30

40

50

60

70

Acc

urac

y %


(c) Percentage of Labeled Nodes

Figure 6-6: The accuracy on three networks w.r.t. (a) different maximal lengths of path l(from 1 to 5), (b) different values of weight parameter ξ (from 0 to 1), and (c) different

percentages of labeled nodes.

ity and structure similarity have not been optimized in those methods, in order to achieve

bets feature selection results for networked data. This partially explain why the accuracies

of DYCOS cannot match IG+SVM for PubMed data set. In comparison, SNOC considers

both labeled and unlabelled nodes, and combines node label similarity and node structure

similarity to find effective features. All these designs help SNOC outperform all other

baseline methods.

Table 6.1: Accuracy Results on Static Network.

Data sets Cora CiteSeer PubMed

IG+SVM 50.34%±1.42% 57.21%±1.59% 65.24%±1.33%

LS+SVM 27.37%±2.85% 39.64%±2.66% 43.06%±2.75%

DYCOS 53.57%±1.24% 64.38%±1.29% 64.53%±1.86%

GS+LR 55.17%±1.09% 65.93%±2.37% 72.88%±2.05%

SNOC 62.66%±1.57% 73.81%±1.46% 81.09%±2.37%

In Fig. 6-5, we report the algorithm performance with respect to different numbers of

selected features on three networks. Overall, SNOC achieves the highest accuracy gain on

all three networks with different feature sizes. LS-SVM has the lowest accuracies because

network structure alone provides very little useful information (compared to the node con-

tent) for node classification. SNOC and GS+LR have the highest accuracies on Cora and

CiteSeer networks when selecting m = 100 features, and on PubMed Diabetes data set

when m = 50, whereas DYCOS’s accuracies decrease with the increasing of the feature

126

size on all three data sets. The accuracies of all methods become close to each other with

the number of selected features continuously increase. This is because that including more

features may introduce interference and dilute the significance of important node features,

so the benefit of feature selection is becoming less significant. Because SNOC balances the

label and structure information to feature space for node classification, it still outperforms

other baseline methods.

In Fig. 6-6(a), we report the accuracies w.r.t. different maximal lengths of path to

calculate Eq. 6.6. The results show that the accuracies decrease if the path lengths are

too long. This is because even though the path between two nodes is relevant to the node

structure similarity, if the path length is too long, the similarity maybe deteriorated by

special paths like cycles and and become inaccurate to capture the node similarity.

In Fig. 6-6(b), we report the algorithm performance w.r.t. the changing values of weight

parameter ξ. According to the definition in Eq. (6.7), ξ is used to balance the contribution

of network structures and node labels. The results from Fig. 6-6(b) show that node labels

play a more important role than network structures. For Cora network, the accuracy reaches

the peak when ξ = 0.6, while the highest accuracies appear on ξ = 0.7 for CiteSeer data

and PubMed data, respectively. This suggests that network structures and node labels have

different contributions to feature selection for networks from different domains. In order to

achieve the best performance, users may need to carefully choose a suitable weight value

for different networks.

In previous experiments, the percentage of labeled nodes is fixed to 40% of the net-

work. In reality, the percentage of labeled nodes in networks may vary significantly, so

in this subsection, we study the performance of all methods on networks with different

percentages of labeled nodes (due to page limitations, we only report the results on Cora

network).

The results in Fig. 6-6(c) show that when the number of labelled nodes in the network

increases, all methods achieve accuracy gains. After the majority of the network nodes

are labeled, all four methods except LS-SVM achieve similar accuracies. This is because

labeled nodes provide sufficient content information for classification. Interestingly, our

results show that when the network contains a small percentage of labeled nodes, e.g. 30%

127

1991 1996 2001 2006 201040

50

60

70

80

90A

ccur

acy

%

SNOCDYCOSGS+LR

(a) DBLP

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1540

50

60

70

80

90

Acc

urac

y %

SNOCDYCOSGS+LR

(b) PubMed Diabetes

1991 1996 2001 2006 201040

50

60

70

80

90

Acc

urac

y %

SNOCDYCOSGS+LR

(c) extended DBLP

Figure 6-7: The accuracy on streaming networks: (a) accuracy on DBLP citation network

from 1991 to 2010, (b) accuracy on PubMed Diabetes network for 15 time points, and (c)

accuracy on extended DBLP citation network from 1991 to 2010.

or less, SNOC can achieve much more significant accuracy gains compared to other meth-

ods. This observation indicates that SNOC is more suitable for networks with very few

labeled nodes. This is mainly attributed to the fact that SNOC can integrate node labels and

network structures (which also include unlabeled nodes) to find most effective features to

characterize the network node content and topology information. In addition, the similarity

gauging process also tries to find optimal node labels for unlabeled nodes to ensure the

distance evaluated in the feature space are consistent with the network structures.

6.4.3 Performance on Streaming Networks

Because only DYCOS and GS+LR are designed for classifying networked data, in the

following, we only compare SNOC with DYCOS and GS+LR on streaming networks.

In Figs. 6-7 (a) and (b), we report accuracies on DBLP and PubMed networks in a

streaming network setting. In addition, Fig. 6-8 further reports the runtime of different

methods. Because GS+LR is designed for static networks, it needs to be rerun at each time

point. Both DYCOS and SNOC can handle streaming networks.

For DBLP network, the results show that the proposed SNOC outperforms all other

methods in streaming network setting. An exception is on 1997, GS+LR method, which

is more time-consuming as shown in Fig. 6-8, can match SNOC. This shows that a good

balance between node content and network structure is very important for node classifica-

tion. Although GS+LR emphasizes on node content information and DYCOS emphasizes

128

on network structures, both of them, however, fail to capture the changes in streaming net-

works. Meanwhile, the runtime performance in Fig. 6-8 shows that DYCOS is as fast as

SNOC but its accuracy is inferior to SNOC because DYCOS uses a random walk to predict

node labels. Because random walks are inherently uncertain and contain many random-

ness in the classification process, the node classification results of DYCOS are inferior to

both SNOC and GS+LR. Meanwhile, as the time steps t continuously increase, the runtime

curve of DYCOS increases much quicker than SNOC. This is because SNOC only needs to

consider the changed part of the network for both node classification and feature selection.

Although GS+LR obtains better accuracies compared to DYCOS, it is much more time-

consuming compared to SNOC and DYCOS. This is mainly because GS+LR is an iterative

algorithm designed for static networks, so it needs to be rerun at each time step.

To validate the performances of different methods on streaming networks with all types

of changes (including dynamic changing node features, addition and deletion nodes and

edges), we allow each node (i.e., paper) to include its reference’s title into the node content.

For example, if a paper pi is cited by a paper pj at a particular time point t, we will include

pj’s title into node pi’s content. By doing so, we can introduce dynamic changing features

to nodes in the network. In addition, we also continuously remove old papers in the network

to maintain papers published within a five-year period. This is will result in node/edge

deletion and feature removal for the whole network. All these settings result in a highly

complicated streaming network setting for node classification. We denote this network

as full streaming DBLP network, and report the results in Fig. 6-7(c). The results show

that SNOC clearly outperforms all other methods in complicated network setting. More

specifically, at Year 1996, the accuracies of all methods deteriorate with significant drops.

This is mainly because that 1996 is the first time that old nodes are removed from the

network. Our method achieves smallest decline-slope compared to other two methods.

Interestingly, when comparing the results in Fig. 6-7(a) and Fig. 6-7(c), we can find

that the average accuracies of SNOC and GS+LR on networks containing all publications

(Fig. 6-7(a)) are higher than the accuracies on networks only containing publications with

a five-year span (Fig. 6-7(c)). Notice that the former contains a much higher node and edge

density in the network, so when the same sets of nodes are given for classification, the rich

129

topology structures in a dense network will help algorithm improve the node classification

accuracies. For DYCOS, its average accuracy in Fig. 6-7(a) is 1.5% lower than the average

accuracy in Fig. 6-7(c). Notice that DYCOS uses random walks for node classification. For

dense networks, the random walks will contain many irrelevant paths, which deteriorate

the classification accuracy. As a result, its accuracy on five-year span networks are actually

better than the accuracy on the whole networks.

1991 1996 2001 2006 2010

500

1,000

1,500

2,000

2,500

3,000

3,500

Run

time

(s)

SNOCDYCOSGS+LR

(a) DBLP

1 6 11 15

500

1,000

1,500

Run

time

(s)

SNOCDYCOSGS+LR

(b) PubMed Diabetes

Figure 6-8: The cumulative runtime on DBLP and PubMed Diabetes networks correspond-

ing to Fig. 6-5.

6.4.4 Case Study

In Fig. 6-9, we use a case study to demonstrate the performance of the three methods

(SNOC, DYCOS and GS-LR) in handling cases with abrupt network changes. In our

experiments, from time points 1 to 3, the network only contain nodes from four classes

(Hardware and Architecture, Applications and Media, System Technology, and others).

From time point 4 to 6, nodes from a new class (DataBases) are included into the network

(including unlabeled nodes). From time points 7 to 9, new nodes from another new class

(Artificial Intelligence) are introduced into the network.

The results in Fig. 6-9 show that, due to the abrupt inclusion of new class nodes, the

accuracies of all methods decrease. When nodes from the new class continuously arrive,

130

SNOC’s accuracy can quickly recover, because SNF in SNOC can find ideal features to

represent changes in the network and use these features to adjust the node classification.

As a result, SNOC can adapt to the changes in the network for node classification.

Figure 6-9: Case study on DBLP citation network.

In summary, our experiments confirm that, for streaming networks, using features to

capture changes and further classifying nodes by assessing the consistency of node labels

and the network structures can provide effective and efficient solutions for node classifica-

tion.

131

132

Chapter 7

Conclusion

7.1 SUMMARY OF THIS THESIS

In this thesis, we have studied mining problems from the view of complex structure data,

where instances (nodes) are not only characterized by the content but are also subject to

dependency relationships. What is more, with the fast development of information tech-

nology, many current real-world data are always featured with dynamic changes. Accord-

ingly, this thesis explores instance correlation in complex structure data and utilizes it to

make mining tasks more accurate and applicable. Our objective is to combine node correla-

tion with node content and utilize them for three different tasks, including (1) graph stream

classification, (2) super-graph classification and clustering, and (3) streaming network node

classification.

More specifically, in Chapter 3, we proposed an empirical study to reveal the roles

of sub-graph features for graph classification. Existing research has commonly agreed

that finding discriminative sub-graphs to represent graphs is one of the main challenges for

graph classification. Yet there is no comprehensive study about (1) the genuine relationship

between sub-graphs and the classification accuracy, and (2) the actual difference between

sub-graphs discovered from expensive mining process (such as frequent sub-graph mining)

with sub-graphs from simple approaches (such as random sub-graphs). In this thesis, we

empirically validated the relationship between sub-graphs and graph classifiers by vary-

ing (1) the sub-graph feature sizes; (2) the size of the sub-graph feature set; (3) different

133

learning algorithms; and (4) different benchmark datasets. We characterized sub-graphs

discovered from different approaches (including random sub-graphs, frequent sub-graphs,

and frequent sub-graphs selected from Information Gain) by their size (i.e. number of

edges) and by their number (i.e. number of sub-graphs in a set), and validated the perfor-

mance of classifiers trained from these sub-graphs on seven benchmark graph datasets from

three domains. Our study drew a number of important findings, which provide a clear view

about the relations between sub-graphs and graph classifications.

In Chapter 4, we proposed to address graph stream classification. We argued that

in graph stream scenarios, the data volumes and the structures of the graphs may con-

stantly change. The existing sub-graph feature-based representation model is not only

inefficient but also ineffective for graph stream classification. This is because the min-

ing of the sub-graph features is time-consuming and the existing occurrence-based graph

representation model will result in significant information loss and will make sub-graph

features ineffective for represent graph data. To solve the problem, we proposed a graph

factorization-based fine-grained representation model, where the main objective is to use

linear combinations of a set of discriminative cliques to represent graphs for learning. The

optimization-oriented factorization approach ensures minimum information loss for graph

representation, and also avoids the expensive sub-graph isomorphism validation process.

Based on this idea, we proposed a novel framework for fast graph stream classification.

Experiments on two real-world graph streams validated the proposed design for effective

graph stream classification.

In Chapter 5, we first formulated a new super-graph classification problem. Due to the

inherent complex structure representation, all existing graph classification methods cannot

be applied for super-graph classification. In the thesis, we proposed a weighted random

walk kernel which calculates the similarity between two super-graphs by assessing (a) the

similarity between super-nodes of the super-graphs, and (b) the common walks of the super-

graphs. Our key contribution is twofold: (1) a weighted random walk kernel considering

node and structure similarities between graphs; and (2) an effective kernel-based super-

graph classification method with sound theoretical basis.

In Chapter 6, we proposed a novel node classification method for streaming networks.

134

We argued that for networks with continuous changes in structure and node content, fea-

tures are the most effective tool to capture such changes. Accordingly, we proposed to

takes network topology structure and node labels into consideration to find an optimal

subset of features to represent the network. Based on the selected features, a streaming net-

work node classification method, SNOC, is proposed to classify unlabeled nodes through

the minimization of the similarity distance in the network and the feature-based distance

between nodes. Experiments and comparisons demonstrate that SNOC is able to capture

emerging changes and outperform baseline approaches with significant performance gain.

The key innovation of the paper compared to the existing methods is twofold: (1) a new

node classification method for handling streaming networks; and (2) a streaming feature

selection method for networked data where selected features can capture both node content

and network structure dependency.

7.2 FUTURE WORK

Even though we have proposed several mining methods on complex structure data, apply-

ing mining algorithms to very large scale problems and dynamic setting still poses chal-

lenges: (1) how to find an effective representation for a high-dimensional large-scale com-

plex structure data, so as to fit in memory, (2) how to capture the dynamic changes on the

complex structure data to adjust the mining results, and (3) how to balance manifold infor-

mation to achieve effective performance. To the best of our knowledge, there is no work

combining large scale learning and dynamic learning. In our future work, we focus on de-

signing mining algorithms and approaches that are faster, data efficient and less demanding

in computational resources to achieve scalable algorithms for large scale problems.

More specifically, we will extend super-graph classification method to clustering prob-

lem on a huge super-graph. To discover clusters from a large network/graph, existing

methods follow three common approaches: (1) Structure-based Clustering using node con-

nectivity only [54, 79, 64]; (2) Attribute-based Clustering using node attribute similar-

ity [73, 86]; and (3) Structural and Attribute Clustering combining node attribute and node

connectivity similarities [14, 80, 89]. All above methods are, however, inapplicable for

135

super-graph clustering mainly because they cannot take internal structures inside super-

nodes for clustering. Indeed, super-nodes may share some overlapped/intersected struc-

tures to help assess the similarity between super-nodes. In addition, the inter-connected

structures between super-nodes also provide useful structure information for clustering.

The existence of the structure dependency within and between super-nodes require a clus-

tering algorithm to take node internal and external structures for super-graph clustering.

To clustering a super-graphs, the main challenge is to properly calculate similarities

between super-nodes by considering internal structures inside each super-node and inter-

connectivity between super-nodes. The complex super-node structure, where each node

is itself another graph, makes it a very challenging problem. The above challenges mo-

tivated our research to combine structural and content similarity between super-nodes for

clustering.

In the future, we will also focus on finding cluster structures for networked instances

and discovering representative features for each cluster represents a special co-clustering

task usefully for many real-world applications, such as automatic categorization of scien-

tific publications and finding representative key-words for each cluster. To date, although

co-clustering has been commonly used for finding clusters for both instances and features,

all existing methods are focusing on instance-feature relationships, without realizing that

topology structures between instances provide very valuable information to help boost co-

clustering performance. We will try to propose a co-clustering method to ensure that the

final cluster structures are consistent across information from the three aspects with mini-

mum errors.

136

Bibliography

[1] Charu. C. Aggarwal. On classification of graph streams. In Proceedings of EleventhSIAM International Conference on Data Mining (SDM’11), pages 652–663, 2011.

[2] Charu C. Aggarwal and Nan Li. On node classification in dynamic content-based

networks. In SDM, pages 355–366, 2011.

[3] Charu C. Aggarwal and Haixun Wang. Managing and mining graph data. Springer,

2010.

[4] Ralitsa Angelova and Gerhard Weikum. Graph-based text classification: learn from

your neighbors. In ACM SIGIR, pages 485–492, 2006.

[5] Karsten M. Borgwardt, Cheng Soon Ong, Stefan Schönauer, S. V. N. Vishwanathan,

Alex J. Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels.

Bioinformatics, 21(1):47–56, 2005.

[6] Horst Bunke. Recent developments in graph matching. In Proceedings. 15th Interna-tional Conference on Pattern Recognition, pages 117–124, 2000.

[7] Horst Bunke and Kaspar Riesen. Graph classification based on dissimilarity space

embedding. In Structural, Syntactic, and Statistical Pattern Recognition (2008), 2008.

[8] Yandong Cai, Nick Cercone, and Jiawei Han. An attribute-oriented approach for

learning classification rules from relational databases. In ICDE, pages 281–288, 1990.

[9] Jérôme Callut, Kevin Françoisse, Marco Saerens, and Pierre Dupont. Classification

in graphs using discriminative random walks. In MLG, 2008.

[10] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines.

TIST, 2(3), 2011.

[11] Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W. Wah, and Jianyong Wang.

Multi-dimensional regression analysis of time-series data streams. In Proceedings ofthe 28th international conference on Very Large Data Bases, pages 323–334, 2002.

[12] Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu. Discriminative frequent

pattern analysis for effective classification. In ICDE, 2007.

137

[13] Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu. Discriminative frequent

pattern-based graph classification. In Link Mining: Models, Algorithms, and Appli-cations, 2010.

[14] Hong Cheng, Yang Zhou, and Jeffrey Xu Yu. Clustering large attributed graphs: A

balance between structural and attribute similarities. ACM TKDD, 5(2), 2011.

[15] Lianhua Chi, Bin Li, and Xingquan Zhu. Fast graph stream classification using dis-

criminative clique hashing. In Advances in Knowledge Discovery and Data Mining(PAKDD’13), pages 225–236, 2013.

[16] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,

1995.

[17] Andrew D. J. Cross, Richard C. Wilson, and Edwin R. Hancock. Inexact graph match-

ing using genetic search. Pattern Recognition, 30(6):953–970, 1997.

[18] Thiago Henrique Cupertino and Liang Zhao. Bias-guided random walk for network-

based data classification. In Advances in Neural Networks, pages 375–384, 2013.

[19] Gröget Hans Dietmar. On the randomized complexity of monotone graph properties.

Acta Cybernetica, 10(3):119–127, 1992.

[20] Paul D. Dobson and Andrew J. Doig. Distinguishing enzyme structures from non-

enzymes without alignments. Journal of molecular biology, 2003.

[21] Pedro Domingos and Geoff Hulten. Mining high speed data streams. In Proceedingsof the sixth ACM SIGKDD international conference on Knowledge discovery and datamining (KDD’00), pages 71–80, 2000.

[22] Gudes Ehud, Solomon Eyal Shimony, and Natalia Vanetik. Discovering frequent

graph patterns using disjoint paths. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 2006.

[23] Terrence S. Furey, Nello Cristianini, Nigel Duffy, David W. Bednarski, Michèl

Schummer, and David Haussler. Support vector machine classification and vali-

dation of cancer tissue samples using microarray expression data. Bioinformatics,

16(10):906–914, 2000.

[24] Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of graph edit dis-

tance. Pattern Analysis and applications, 13(1):113–129, 2010.

[25] Thomas Gärtner, Peter A. Flach, and Stefan Wrobel. On graph kernels: Hardness

results and efficient alternatives. In COLT, pages 129–143, 2003.

[26] Ray Geagans and Bill McEvily. Network structure and knoweldge transfer: The ef-

fects of cohesion and range. Administrative Science Quarterly, 48(2):240–267, 2003.

138

[27] Gregory Gutin, Anders Yeo, and Alexey Zverovich. Traveling salesman should not be

greedy: Domination analysis of greedy-type heuristics for the TSP. Discrete AppliedMathematics, 2002.

[28] Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: Cur-

rent status and future directions. Data Mining and Knowledge Discovery, 2007.

[29] Johan Himberg, Kalle Korpiaho, Heikki Mannila, and Johanna Tikanmäki. Time

series segmentation for context recognition in mobile devices. In Proceedings IEEEInternational Conference on Data Mining, pages 203–210, 2001.

[30] Kashima Hisashi, Koji Tsuda, and Akihiro Inokuchi. Marginalized kernels between

labeled graphs. In ICML, 2003.

[31] Tamás Horváth, Thomas Gärther, and Stefan Wrobel. Cyclic pattern kernels for pre-

dictive graph mining. In ACM SIGKDD, 2004.

[32] Jun Huan, Wei Wang, and Jan Prins. Efficient mining of frequent subgraphs in the

presence of isomorphism. In ICDM, 2003.

[33] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm

for mining frequent substructures from graph data. In PKDD, 2000.

[34] David Jensen, Jennifer Neville, and Brian Gallagher. Why collective inference im-

proves relational classification. In SIGKDD, 2004.

[35] Ning Jin, Calvin Young, and Wei Wang. GAIA: Graph classification using evolution-

ary computation. In ACM SIGKDD, 2010.

[36] J.R.Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.

[37] Hisashi Kashima and Akihiro Inokuchi. Kernels for graph classification. In ICDMWorkshop on Active Mining, volume 2002, 2002.

[38] Riesen Kaspar and Horst Bunke. Cluster ensembles based on vector space embed-

dings of graphs. Multiple Classifier Systems, 2009.

[39] Riesen Kaspar and Horst Bunke. Graph classification by means of lipschitz embed-

ding. IEEE Transaction on Systems, Man, and Cybernetics, Part B, 2009.

[40] Riesen Kaspar and Horst Bunke. Graph Classification and Clustering Based on Vec-tor Space Embedding. World Scientific Publishing Co., Inc., 2010.

[41] Xiangnan Kong, Wei Fan, and Philip S. Yu. Dual active feature and sample selection

for graph classification. In ACM SIGKDD, 2011.

[42] Xiangnan Kong and Philip S. Yu. Semi-supervised feature selection for graph clas-

sification. In Proceedings of the 16th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 793–802, 2010.

139

[43] Taku Kudo, Eisaku Maeda, and Yuji Matsumoto. An application of boosting to graph

classification. Advances in neural information processing systems, 2004.

[44] Solomon Kullback. Information theory and statistics. Courier Dover Publications,

1968.

[45] Michihiro Kuramochi and George Karypis. Frequent subgraph discovery. In ICDM,

2001.

[46] Bin Li, Xingquan Zhu, Lianhua Chi, and Chengqi Zhang. Nested subtree hash kernels

for large-scale graph classification over streams. In 12th International Conference onData Mining (ICDM’12), pages 399–408, 2012.

[47] Geng Li, Murat Semerci, Bülent Yener, and Mohammed J. Zaki. Graph classification

via topological and label attributes. In 9th Workshop on Mining and Learning withGraphs (with SIGKDD), 2011.

[48] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolic represen-

tation of time series, with implications for streaming algorithms. In Proceedings ofthe 8th ACM SIGMOD workshop on Research issues in data mining and knowledgediscovery, pages 2–11, 2003.

[49] Qing Lu and Lise Getoor. Link-based classification. In ICML, pages 496–503, 2003.

[50] Xiangfeng Luo, Zheng Xu, Jie Yu, and Xue Chen. Building association link network

for semantic link on web resources. IEEE Transactions onAutomation Science andEngineering, 8(3):482–494, 2011.

[51] J. McAuley and J. Leskovec. Image labelling on a network: using social-network

metadata for image classification. In ECCV, volume 4, pages 828–841, 2012.

[52] Kenrick Mock. An experimental framework for email categorization and manage-

ment. In SIGIR, pages 392–393, 2001.

[53] Elena Nabieva, Kam Jim, Amit Agarwal, Bernard Chazelle, and Mona Singh. Whole-

proteome prediction of protein function via graph-theoretic analysis of interaction

maps. Bioinformatics, 21(Suppl.1):i302–i310, 2005.

[54] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in

networks. Phys. Rev. E, 69(2):026113, 2004.

[55] H. J. Nussbaumer. Fast fourier transform and convolution algorithms. Springer seriesin information sciences, 2, 1982.

[56] Sayan Ranu and Ambuj K. Singh. Graphsig: A scalable approach to mining signif-

icant subgraphs in large graph databases. In Proceedings of the 2009 InternationalConference on Data Engineering, pages 844–855, Shanghai, China, March 2009.

140

[57] John W. Raymond, Eleanor J. Gardiner, and Peter Willet. Rascal: Calculation of

graph similarity using maximum common edge subgraphs. The Computer Journal,45(6):3–35, 2002.

[58] Hiroto Saigo, Nicole Krämer, and Koji Tsuda. Partial least squares regression for

graph mining. In Proceedings of the 14th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 578–586, 2008.

[59] Hiroto Saigo, Sebastian Nowozin, Tadashi Kadowaki, Taku Kudo, and Koji Tsuda.

gboost: a mathematical programming approach to graph classification and regression.

Machine Learning, 75(1):69–89, 2009.

[60] Bernhard Schölkopf and Alexander J. Smola. Learning with kernels. MIT Press,

2002.

[61] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, and Lise Getoor. Collective classifi-

cation in network data. In Encyclopedia of Machine Learning, 2010.

[62] D. Serre. Matrices: Theory and Applications. Springer, New York, 2002.

[63] Nino Shervashidze and Karsten M. Borgwardt. Fast subtree kernels on graphs. Ad-vances in Neural Information Processing Systems, 22:1660–1668, 2009.

[64] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans.Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

[65] Aaron Smalter, Jun Huan, Y. Jia, and Gerald Lushington. GPD: a graph pattern dif-

fusion kernel for accurate graph classification with applications in cheminformatics.

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(2):197–

207, 2010.

[66] R. R. Sokal and F. J. Rohlf. Biometry: the principles and practice of statistics inbiological research. W.H. Freeman & Co Ltd, New York, 1981.

[67] Srinivasa Srinath and Sujit Kumar. A platform based on the multi-dimensional data

modal for analysis of bio-molecular structures. In Proceedings of the 29th interna-tional conference on Very large data bases, volume 29, pages 975–986, 2003.

[68] Jacopo Staiano, Bruno Lepri, Nadav Aharony, Fabio Pianesi, Nicu Sebe, and Alex

Pentland. Friends don′t lie-inferring personality traits from social network structure.

In Ubicomp, pages 321–330, 2012.

[69] W. Nick Streetand and YongSeog Kim. A streaming ensemble algorithm (SEA) for

large-scale classification. In Proceedings of the seventh ACM SIGKDD internationalconference on Knowledge discovery and data mining (KDD’01), pages 377–382,

2001.

[70] Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna. Detecting spammers

on social networks. In ACSAC, pages 1–9, 2010.

141

[71] Lei Tang, Huan Liu, Jianping Zhang, and Zohreh Nazeri. Community evolution in

dynamic multi-mode networks. In ACM SIGKDD, 2008.

[72] Lei Tang, Huan Liu, Jianping Zhang, and Zohreh Nazeri. Community evolution in

dynamic multi-mode networks. In Proceedings of the 14th ACM SIGKDD interna-tional conference on Knowledge discovery and data mining (KDD’08), pages 677–

685, 2008.

[73] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. Efficient aggregation for

graph summarisation. In ACM SIGMOD, pages 567–580, 2008.

[74] Roger Ming Hieng Ting and James Bailey. Mining Minimal Contrast Subgraph Pat-terns. University of Melbourne, Department of Computer Science and Software En-

gineering, 2007.

[75] N. Vanetik, E. Gudes, and S. E. Shimony. Computing frequent graph patterns from

semistructured data. In Proceedings of the 2002 IEEE International Conference onData Mining, pages 458–465, 2003.

[76] Joshua T. Vogelstein, William R. Gray, R. Jacob Vogelstein, and Carey E. Priebe.

Graph classification using signal-subgraphs: Applications in statistical connectomics.

Applications in Statistical Connectomics, 2011.

[77] Haixun Wang, Wei Fan, Philip S. Yu, and Jiawei Han. Mining concept-drifting data

streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD in-ternational conference on Knowledge discovery and data mining (KDD’03), pages

226–235, 2003.

[78] Xindong Wu, Kui Yu, Wei Ding, Hao Wang, and Xingquan Zhu. Online feature se-

lection with streaming features. IEEE Transactions on Pattern Analysis and MachineIntelligence, 35(5):1178–1192, 2013.

[79] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas A. J. Schweiger. Scan: a

structural clustering algorithm for networks. In KDD, pages 824–833, 2007.

[80] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. A model-based

approach to attributed graph clustering. In Proc. of ACM SIGMOD, pages 505–516,

2012.

[81] Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S. Yu. Mining significant graph

patterns by leap search. In ACM SIGMOD, 2008.

[82] Xifeng Yan and Jiawei Han. gSpan: Graph-based substructure pattern mining. In

ICDM, 2003.

[83] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: A frequent structure-based

approach. In ACM SIGMOD, 2004.

142

[84] Xifeng Yan, Philip S. Yu, and Jiawei Han. Substructure similarity search in graph

databases. In ACM SIGMOD, 2005.

[85] Eng-Hui Yap, Tyler Rosche, Steve Almo, and Andras Fiser. Functional clustering

of immunoglobulin superfamily proteins with protein-protein interaction information

calibrated hidden markov model sequence profiles. 426:945–961, 2013.

[86] Xiaoxin Yin, Jiawei Han, and Philip S. Yu. Cross-relational clustering with user’s

guidance. In ACM SIGKDD, pages 344–353, 2005.

[87] Yuchen Zhao, Xiangnan Kong, and Philip S. Yu. Positive and unlabeled learning for

graph classification. In IEEE 11th International Conference on Data Mining, pages

962–971, 2011.

[88] Zheng Zhao and Huan Liu. Semi-supervised feature selection via spectral analysis.

In SDM, pages 641–646, 2007.

[89] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on struc-

tural/attribute similarities. In the VLDB Endowment, volume 2, pages 718–729, 2009.

[90] Yanyue Zhu and Dennis Shasha. Statstream: Statistical monitoring of thousands of

data streams in real time. In Proceedings of the 28th international conference on VeryLarge Data Bases, pages 358–369, 2002.

143

Documents

Real-Time Analytics for Complex Structure Data