Upload
hoangthuy
View
218
Download
0
Embed Size (px)
Citation preview
Fast Identification of Structured P2P Botnets usingCommunity Detection Algorithms
A THESIS
SUBMITTED FOR THE DEGREE OF
Master of Science (Engineering)
IN THE FACULTY OF ENGINEERING
by
Bharath Venkatesh
Supercomputer Education and Research Centre
Indian Institute of Science
BANGALORE – 560 012
July 2013
Acknowledgements
I wish to express my sincerest gratitude to my research supervisor Prof. N. Balakrishnan. His
mastery of diverse subjects, thoughtful guidance and ideas opened up a completely new vista
of knowledge for me and reshaped my way of thinking. His hard work and dedication to sci-
ence make him my role model whom I will always look upto and cherish the invaluable time
spent with him. I am also indebted for life to him for his utmost support, encouragement and
inspiration throughout the period. It is because of him that I was able to move to the exciting
world of Computer Science.
I thank Prof. R. Govindarajan, the chairman of SERC, and my course advisors who have
helped me immensely during the entire course of my stay in IISc. I always feel fortunate for
the lifetime opportunity to work in this institute alongside all eminent scientists and stay in
this wonderful campus.
I thank Shishir Nagaraja for providing datasets used in this thesis and implementation guide-
lines and source code for the BotGrep Algorithm. I also thank him for helpful discussions
related to this work. Prof. Virgilio Almeida, Ponnurangam K and all others who visited our
lab, they enhanced my general exposure and provided suggestions to improve the quality and
presentation of my work.
I am very grateful to Ms. Nagarathna, Ms. Swarna and Mr. Ravi for all their support through-
out my tenure. I also thank SERC and the Information Systems Lab for providing us the best
computing facilities. I am very thankful
I feel always very fortunate to have been a part of a wonderful family during my stay in the lab,
which has been my home away from home. I would like to especially thank Sudip who has
had a great influence on my approach to research and life. My heartfelt thanks to Naimisha,
i
ii
who has been the big sister I have always wanted. Without her, this thesis would have never
taken shape. Special thanks to Saradha who completes this amazing gang of four which was
one of the most important parts of my life in IISc. Nikhil has also helped me in several areas
of my work and has been great company. I would also like to thank Pritam, Prashant, Venkat,
Indira Ma’am, Negi, Nivedita and all other members of Information Systems Lab who have
continuously supported me. I am indebted to all of them for providing a healthy atmosphere,
stimulating and fun environment to learn and grow. I will always cherish the moments spent
with them.
My heartful appreciation to Aravind, Kamala, Gopal, Hari K, Ashwin and all other friends
from IISc for extending a helping hand and making my stay at IISc memorable for lifetime.
I would also like to acknowledge the tremendous support received from Satyaki, Abhilash,
Prasanth, Lavanya and many other friends outside IISc.
Lastly, but most importantly, I greatly thank my parents for their unconditional support and
love throughout. They mean the world to me. I would also like to thank Mrs Padmavathy
Sundarrajan, a new found friend who helped keep my spirits up especially when it was most
needed.
Abstract
Botnets are a global problem, and effective botnet detection requires cooperation of large In-
ternet Service Providers, allowing near global visibility of traffic that can be exploited to detect
them. The global visibility comes with huge challenges, especially in the amount of data that
has to be analysed. To handle such large volumes of data, a robust and effective detection
method is the need of the hour and it must rely primarily on a reduced or abstracted form of
data such as a graph of hosts, with the presence of an edge between two hosts if there is any
data communication between them. Such an abstraction would be easy to construct and store,
as very little of the packet needs to be looked at.
Structured P2P command and control have been shown to be robust against targeted and ran-
dom node failures, thus are ideal mechanisms for botmasters to organize and command their
botnets effectively. Thus this thesis develops a scalable, efficient and robust algorithm for the
detection of structured P2P botnets in large traffic graphs. It draws from the advances in the
state of the art in Community Detection, which aim to partition a graph into dense communi-
ties.
Popular Community Detection Algorithms with low theoretical time complexities such as La-
bel Propagation, Infomap and Louvain Method have been implemented and compared on large
LFR benchmark graphs to study their efficiency. Louvain method is found to be capable of han-
dling graphs of millions of vertices and billions of edges. This thesis analyses the performance
of this method with two objective functions, Modularity and Stability and found that neither
of them are robust and general.
iii
iv
In order to overcome the limitations of these objective functions, a third objective function
proposed in the literature is considered. This objective function has previously been used in
the case of Protein Interaction Networks successfully, and used in this thesis to detect struc-
tured P2P botnets for the first time. Further, the differences in the topological properties -
assortativity and density, of structured P2P botnet communities and benign communities are
discussed. In order to exploit these differences, a novel measure based on mean regular degree
is proposed, which captures both the assortativity and the density of a graph and its properties
are studied.
This thesis proposes a robust and efficient algorithm that combines the use of greedy com-
munity detection and community filtering using the proposed measure mean regular degree.
The proposed algorithm is tested extensively on a large number of datasets and found to be
comparable in performance in most cases to an existing botnet detection algorithm called Bot-
Grep and found to be significantly faster.
Contents
Acknowledgements i
Abstract iii
List of Tables viii
List of Figures ix
1 Introduction 11.1 Botnets and Botnet Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Complex Networks and Community Detection . . . . . . . . . . . . . . . . . 21.3 Motivation and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Botnets and Botnet Detection 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Operation of a Typical Bot . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Botnet Command and Control . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Centralized Command and Control . . . . . . . . . . . . . . . . . . 102.3.2 Decentralized or Peer-to-Peer (P2P) Command and Control . . . . . 11
2.4 Botnet Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Methods of Botnet Detection . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Detection of Structured P2P Botnets in Large Scale Networks . . . . . . . . . 192.5.1 BotGrep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Community Detection Algorithms 233.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Edges, Directionality and Weights . . . . . . . . . . . . . . . . . . . 243.2.2 Adjacency Matrix, Degree and Transition Probability Matrix . . . . . 243.2.3 Degree Distributions and Random Graph Models and Assortativity . . 253.2.4 Power Laws or Scale Free Networks . . . . . . . . . . . . . . . . . . 25
v
CONTENTS vi
3.2.5 Random Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.6 Assortativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.7 Paths, Connected Components, and Betweenness Centrality . . . . . 263.2.8 Subgraphs, Covers and Partitions . . . . . . . . . . . . . . . . . . . 27
3.3 Community and Community Structure . . . . . . . . . . . . . . . . . . . . . 273.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.2 Partition Quality Functions . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Community Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . 303.5 Efficiency Comparison of Community Detection Algorithms . . . . . . . . . 35
3.5.1 Dataset - LFR Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 353.5.2 Discussion and Algorithm Selection . . . . . . . . . . . . . . . . . . 363.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Identifying Structured P2P Botnets using the Louvain Method 384.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 The Louvain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 Greedy Modularity Optimization . . . . . . . . . . . . . . . . . . . . 404.2.2 Community Aggregation . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.1 Network Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.2 Background Graph Construction and Properties . . . . . . . . . . . . 434.3.3 Structured P2P Graph Generation . . . . . . . . . . . . . . . . . . . 444.3.4 Embedding the Botnet . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Application of the Louvain method to identify structured P2P botnets . . . . 464.4.1 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 464.4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Community Detection at different resolutions and Multiresolution Modularity 514.5.1 Resolution Limit and Multiresolution Modularity . . . . . . . . . . . 514.5.2 Stability and Stability Optimization . . . . . . . . . . . . . . . . . . 52
4.6 Optimization of Stability using the Louvain Method to identify Structured P2PBotnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.6.1 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 544.6.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 A robust algorithm for identification of Structured P2P Botnets 615.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2 Obtaining Small and Homogeneous Communities . . . . . . . . . . . . . . . 62
5.2.1 An alternative Objective Function - Qw−log−v . . . . . . . . . . . . . 625.2.2 Optimizing Qw−log−v - Louvain method vs Single Step Greedy Opti-
mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3 Differentiating between bot and benign communities . . . . . . . . . . . . . 66
5.3.1 Properties of Structured P2P Botnets vs Properties of the Background 66
CONTENTS vii
5.3.2 Properties of the small and homogeneous communities obtained by thegreedy optimization of Qw−log−v . . . . . . . . . . . . . . . . . . . . 67
5.3.3 Mean Regular Degree mreg . . . . . . . . . . . . . . . . . . . . . . . 675.4 Robust and efficient method to identify nodes that are part of structured P2P
Botnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.4.1 Stage 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.4.2 Stage 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.5.2 Performance on Abilene Trace Graphs . . . . . . . . . . . . . . . . . 72
5.6 Robustness of the proposed algorithm . . . . . . . . . . . . . . . . . . . . . 765.6.1 Robustness under conditions of partial visibility . . . . . . . . . . . . 765.6.2 Performance on the LEET-Chord Topology . . . . . . . . . . . . . . 795.6.3 Efficiency and Scalability . . . . . . . . . . . . . . . . . . . . . . . 81
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 Summary and Conclusions 856.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.1 Efficiency Comparison of Community Detection Algorithms . . . . . 856.1.2 Detection of Structured P2P Botnets using the Louvain Method . . . 856.1.3 Robust and Efficient method to detect Structured P2P Botnets . . . . 86
6.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References 89
List of Tables
4.1 Properties of Graphs extracted from the Network Traces . . . . . . . . . . . 444.2 Comparison of Louvain Modularity and BotGrep on Abilene Traces . . . . . 504.3 Performance Summary of Modularity Optimization using the Louvain Method,
with reference to BotGrep . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4 Optimizing Stability at different values of t on Abilene Traces using the Lou-
vain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Performance Summary of Modularity Optimization vs Stability Optimization(t=0.25)
using the Louvain Method, with reference to BotGrep . . . . . . . . . . . . . 59
5.1 Community Structure obtained by optimization ofQw−log−v using the Louvainmethod on CHORD Botnets embedded in Abilene WASH Router Trace Graphs 66
5.2 Performance comparison of the proposed method and BotGrep on AbileneTrace Graphs - Precision and Recall . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Summary of the proposed method with reference to BotGrep on the Abilenetraces (FScore) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Performance comparison of the proposed method and BotGrep on AbileneTrace Graphs under conditions of partial visibility - Precision and Recall . . . 78
5.5 Performance(FScore) Summary of the proposed method with reference to Bot-Grep on the Abilene traces under conditions of partial visibility . . . . . . . . 79
5.6 Performance comparison of the proposed method and BotGrep on LEET-Chordgraphs embedded in Abilene Trace Graphs - Precision and Recall . . . . . . . 81
5.7 Performance of the Proposed Method on CAIDA Datasets - Precision, Recalland FScore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
viii
List of Figures
2.1 Botnet Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Botnet Command and Control Topologies . . . . . . . . . . . . . . . . . . . 102.3 A CHORD Graph with 16 nodes(Image from [68]) . . . . . . . . . . . . . . 122.4 A DeBruijn Graph of 3 length string of alphabet 01 (Image [43]) . . . . . . . 122.5 A network partition for node 6 in a Kademlia network of 8 nodes([52]) . . . . 13
3.1 Communities in a Graph (Image from [31]) . . . . . . . . . . . . . . . . . . 283.2 Comparison of CDA on LFR Benchmarks . . . . . . . . . . . . . . . . . . . 363.3 Comparison of CDA on LFR Benchmarks - Louvain vs CNM . . . . . . . . . 36
4.1 The Louvain Method (Image from [3]) . . . . . . . . . . . . . . . . . . . . . 394.2 An example of a graph with a superimposed botnet (Image from [44]) . . . . 464.3 Performance of the Louvain Method on Abilene Trace Graphs . . . . . . . . 494.4 Performance of Stability Optimization on Abilene Trace Graphs . . . . . . . 554.5 Comparison of Stability Optimization(t=0.25) and BotGrep on Abilene Trace
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Performance comparison of the proposed method and BotGrep on AbileneTrace Graphs - FScore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Performance comparison of the proposed method and BotGrep on AbileneTrace Graphs under conditions of partial visibility - FScore . . . . . . . . . . 77
5.3 Performance comparison of the proposed method and BotGrep on LEET-Chordgraphs embedded in Abilene Trace Graphs - FScore . . . . . . . . . . . . . . 80
5.4 Runtime comparison of proposed method and BotGrep:Abilene 1 Day Tracesand 1000 Node CHORD Botnet . . . . . . . . . . . . . . . . . . . . . . . . 82
ix
Chapter 1
Introduction
1.1 Botnets and Botnet Detection
Computers today are subject to infections from a variety of malicious software or malware
such as viruses, trojans, worms, keyloggers. Such infections are a serious threat to the security
and privacy of the user. However cyber-criminals went a step further and created networks of
these malware infected or compromised hosts, that operate in coordination, ready to do their
bidding and engage in activities that potentially threaten the security of the entire Internet.
These networks of compromised computers (called zombies or bots) are called botnets. The
controller of these hosts – the botmaster or botherder can control the entire botnet remotely,
and thus has an illegal distributed cloud of computers in his possession, which he can exploit
for carrying out malicious activities for economic or political gain. Botnets are typically used
to send spam e-mail and are responsible for about 80% of the spam e-mail [67]. They are used
to execute Distributed Denial of Service Attacks(DDoS), perform click-fraud, host phishing
sites, harvest sensitive and private information such as credit card numbers and passwords,
[27] As of 2012, an estimated 3-7% of enterprise hosts, and 10% of home computers were
found to be bot-infected according to Damballa Inc[12].
Botnets are controlled and coordinated by a command and control channel that may be cen-
tralized or decentralized. Centralized mechanisms have one or more command and control
servers which the bots connect to in order to receive orders. Decentralized or peer-to-peer
1
Chapter 1. Introduction 2
(P2P) mechanisms are gaining preference among botmasters owing to their resilience and re-
sistance against targeted attacks, which would dismantle botnets having central servers. The
first step of mitigating the botnet threat is their detection.Host based methods operate similar to
anti-virus systems and detect activities of the bot in the host system. Network based methods
rely on features obtained by passive monitoring of network traffic. Network based approaches
are the most popular owing to the relative ease of deployment. A variety of techniques from
different areas have been applied to for network based detection of botnets. Traffic mining,
clustering, correlation, entropy analysis, stochastic modelling, time series analysis and other
machine learning based techniques have been proposed in literature and surveyed by Silva et
al. in [63].
1.2 Complex Networks and Community Detection
Most real world systems can be modelled as complex networks or graphs where vertices rep-
resent the entities and the presence of an edge represents an interaction of any kind between
them. The field of complex networks is aimed at studying the topological properties of these
networks and understanding their dependence on the function of the real world systems. The
field is interdisciplinary, and has seen contributions from biologists, computer scientists, physi-
cists and statisticians.
In most complex networks, there exists community structure , where certain groups or com-
munities of nodes are more tightly knit as compared to the rest of the graph. It is of interest
to study these communities, as they can reveal information about the structure and dynamics
of the system, and reveal entities that are similar. Community Detection Algorithms aim to
detect the community structure in a graph by partitioning it into densely connected subgraphs,
and have been a hot topic for research in the complex networks community.
With the modelling of network traffic data as graphs, and techniques such as community detec-
tion algorithms can brought from the field of complex networks to tackle problems in computer
network security such as botnet detection.
Chapter 1. Introduction 3
1.3 Motivation and Objective
Peer-to-Peer topologies can be structured and unstructured. Structured P2P topologies were
proven to be ideal topologies for Botnets by Davis et al.[13] on the basis of their resilience to
dismantling and destabilization. An important observation is that Botnets are a global threat.
Effective mitigation of this threat is in the interest of all nations and corporations, and inter-
national cooperation is needed. Botnet Detection and Mitigation is in interest of the Internet
Service Provider(ISP) as well, owing to wasted bandwidth on malicious applications and spam
e-mails typically associated with botnets.
Assuming co-operation of the most important or Tier-1 ISP’s passive monitors can be deployed
to collect traffic at the backbone routers of these large ISP’s. This will result in global visibility
of network traffic that can be exploited to detect botnets.
The large volume of traffic renders most of the current detection methods useless. In such a
setting only a reduced or abstracted form of the data can be effectively handled. A simple
abstraction is the construction of a graph with the nodes as hosts and edges if they send a
packet. Even after this abstraction, handling these large graphs (millions of nodes, hundreds
of millions of edges) is still a challenge.
Nagaraja proposed BotGrep[44] which works on such a graph constructed from network traf-
fic, and uses the topological properties of botnet command and control (C2C) communication
graphs to separate them from benign traffic.
BotGrep is tested on synthetic botnet topologies superimposed on a graph constructed from
real world backbone traces, and is found to give high accuracy on the datasets tested.
A structured P2P botnet should have a high internal connectivity among its members so as
to achieve robustness against targeted and random failures. This provides the motivation for
using community detection algorithms in order to detect them. Nagaraja et. al. [44] have also
compared the performance of BotGrep with several Community Detection Algorithms on a
scaled down sampled graph, and conclude that
Chapter 1. Introduction 4
“While these traditional techniques were not intended to scale to the large data sets we consider
here, they may be appropriate for localizing smaller botnets in contained environments (e.g.,
within a single Honeynet, or the part of a botnet contained within an enterprise network)”[44]
Community Detection Algorithms have received a great deal of attention over the last few
years, and several scalable algorithms have been proposed that are capable of handling graphs
of millions of vertices and billions of edges. They are very general and can be easily adapted
for large scale detection of structured P2P botnets
The aim of this thesis is to study in depth the applicability of Community Detection Algo-
rithms and to develop a scalable, efficient and robust algorithm for detection of structured P2P
subgraphs detection that draws from the advances in the state of art in community detection
and compare the performance with those of BotGrep
1.4 Approach
This thesis first surveys the community detection algorithms used in complex network. From
the theoretical analysis available in the literature on time complexity of the algorithms only
those which have time complexity linear in the number of edges such as Label Propagation
[54], Infomap [56] and Louvain Method [3] have been considered for implementation. This
exercise points to the conclusion that the Louvain method is most suitable for the identification
of structured P2P botnets in large networks.
The application of the Louvain method as proposed in [3] resulted in performances comparable
to BotGrep for sparsely connected background networks, while for dense background graphs it
is found to be inferior. A deeper analysis brought out the need to improve the original Louvain
method particularly in situations where the background is dense.
The Louvain method from [3] considers modularity as the objective function.StabilityOptimization
proposed by Lambiotte et al. [36] is a modification of modularity to incorporate a parameter t
which is the weight on the internal density of each community. The optimization of Stability
Chapter 1. Introduction 5
has also been implemented according to [36] and analysed for its applicability in botnet de-
tection. This technique seems to suffer from a disadvantage of the increase in computational
efforts to search for an optimal value of t.
This formed the motivation for exploring another objective function w-log-v proposed and
has been successfully applied in protein-protein interaction networks by Van Laarhoven and
Marchiori[72]. The optimization of w-log-v using the Louvain Method resulted in the identifi-
cation small fragments of either botnets or benign communities with high precision.
The thesis then proposes a method to distinguish between the benign and botnet communities
and aggregate the later. This has been achieved using a novel scoring function, which is based
on mean degree and degree homogeneity of the communities. Results from this combined
comprehensive technique has been presented and compared favourably with BotGrep
The overall process results in a novel technique of detecting structure P2P botnets while having
the advantage of being faster by almost 300 times for graphs of million edges or more.
1.5 Organization of the Thesis
The rest of this thesis is organized into 5 other Chapters
Chapter 2
This chapter describes Botnets and the lifecycle of a typical bot is studied. Botnet Command
and Control is described in detail and various Botnet Detection Techniques are described. Prior
art in detection of botnets in large networks. The BotGrep [44] method is reviewed in depth.
Chapter 3
This chapter contains a survey of Community Detection Algorithms (CDA). Algorithms found
to be scalable in terms of theoretical time complexity have been identified and implemented.
The running time of these algorithms are then compared on standard community detection
benchmark graphs.
Chapter 1. Introduction 6
Chapter 4
This chapter describes the defined Louvain Method in detail and applies it to detect structured
P2P botnets. For the purpose of the evaluation, the method to generate datasets is described.
The optimization of Modularity as well as Stability in order to detect structured P2P botnets is
studied.
Chapter 5
This chapter considers the Qw−log−v objective function as an alternative objective function for
optimization by the Louvain Method. The behaviour of the objective function on the datasets is
studied. The novel measure mreg is proposed to differentiate between benign and bot commu-
nities. The proposed algorithm that combines the greedy optimization of Qw−log−v and built
around mreg is described. A comprehensive evaluation of the method is carried out.
Chapter 6
This chapter summarises the contributions, makes concluding remarks and lays down direc-
tions for future work.
Chapter 2
Botnets and Botnet Detection
2.1 Introduction
Botnets are networks of compromised hosts (or bots) that can be remotely controlled by an
attacker (or botmaster). A bot is typically connected to the Internet, and has been infected
by some malware so as to be able to communicate with the botmaster. The botmaster can
control each bot remotely over the network, and can instruct it to update the malware installed
on the bot system, send spam e-mails, execute denial-of-service attacks and harvest private
information.The lifecycle of a typical bot is described in Section 2.2.
Bots are coordinated and controlled using a command and control (C2C) channel, which can be
centralized or peer-to-peer. The topology of the C2C determines the efficiency and resilience
of the botnet. A description of the C2C mechanisms of botnets is provided in Section 2.3.
Section 2.4 contains a survey of Botnet Detection techniques. Section 2.5 describes literature
on detection of botnets in large networks, with special emphasis on BotGrep[44].
2.2 Operation of a Typical Bot
A typical bot host has a life cycle as shown in Figure 2.1. The stages of this life cycle are
Infection Rallying Binary/Egg Download Wait for Orders Execution of Orders Termination
7
Chapter 2. Botnets and Botnet Detection 8
Figure 2.1: Botnet Life Cycle
Infection
The bot can enter the host through a variety of infection vectors – user-mediated or facilitated
actions such as opening malicious e-mail attachments, downloading of malicious software
from a phishing affected site. It can also enter through drive-by-download, where it enters
the users system without his knowledge by exploiting common web browser vulnerabilities.
Network based exploitation of vulnerabilities is also another way of infection.
Rallying
After a successful infection, a bot has to communicate with its botmaster to inform him of
its infection. This is normally done through hard-coded domain names or IP addresses, more
recently botnets have begun employing Domain Name Generation Algorithms[51] in order to
avoid the hard coding step to help hide the botmaster better. Domain Generation Algorithms
create random domain names using a seed known by the botmaster. The botmaster can then
register a small fraction of the domains from the sequence of generated algorithms. As bots try
and connect to several of these domains per day, there is a good chance they will hit a domain
Chapter 2. Botnets and Botnet Detection 9
name registered by the botmaster.
Binary Download/Update
In many cases, the initial binary only contains the code to enable the botmaster to rally the bots.
A back-door is typically installed to allow remote access of the system. Bots are designed to
be modular, with pluggable components to execute different actions. After rallying, the bots
may be instructed by the botmaster to download the appropriate modules[1]. For instance a
botnet designed to send spam e-mail will be instructed to download spam mail templates and
relevant code needed to send mails[50].
Wait for Orders
After the bot has downloaded all the necessary components, it will go into a wait state. The bot
code will continue to run on the system, periodically polling for commands from the botmaster.
If it has a spyware component, it will continue to log keystrokes and try to harvest other
sensitive data from the system.
Execute Orders
On receiving orders from the botmaster, the bot may participate in a DDoS attack, send spam,
download a list of URL’s to perform click-fraud or be instructed to crack passwords using brute
force techniques[27]. It may also be instructed to update itself with a newer version. Another
important action is to execute TCP scans of other Internet hosts in order to discover open and
vulnerable ports to exploit and recruit further bots.
Termination
In some cases, if the botmaster may wants to dismantle his botnet, the bot may be instructed
to delete itself from the host, clearing all records of its existence.
Chapter 2. Botnets and Botnet Detection 10
(a) Centralized ([76]) (b) Unstructured P2P ([64]) (c) Structured P2P ([68])
Figure 2.2: Botnet Command and Control Topologies
2.3 Botnet Command and Control
The distinguishing feature of a bot from other malware is its coordination with other bots
that are a part of the same botnet. This coordination is enabled through a command and
control channel (C2C) between the botmaster and his bots. The command and control can be
classified into centralized or decentralized depending on the communication topology among
the botmaster and the bots. A hybrid topology combining a centralized component and a peer-
to-peer component has also been proposed in literature[76].
2.3.1 Centralized Command and Control
In this topology, there exist one or more command and control servers controlled by the bot-
master that are used to issue orders to the bots, and all the bots are aware, and able to contact
these servers. Early botnets used the Internet Relay Chat (IRC) protocol, with C2C servers
hosting channels that individual bots can join. The botmaster can use the channel to push
commands to bots. The popularity of the IRC protocol waned owing to the rise of Instant
Messaging. In order to blend in with the predominant Hypertext Transfer Protocol(HTTP)
based web traffic, botnets began to use the HTTP protocol, where individual bots will pull
instructions from a web server.
Chapter 2. Botnets and Botnet Detection 11
2.3.2 Decentralized or Peer-to-Peer (P2P) Command and Control
The main weakness of centralized command and control is that it is prone to targeted attacks,
if the C2C servers are taken down, the botnet is completely paralysed. To overcome this
limitation, botmasters began to migrate to P2P Command and Control. A detailed survey of
Peer-to-Peer networking can be found in [61]. In this mechanism, there are no fixed command
and control servers. The botnet can be controlled by the botmaster from any node. A bot will
poll the network for orders by looking for a certain file of commands, this may be uploaded
to any node by the botmaster. Thus the botnet functions by issuing lookups for the command
files. The efficiency of this lookup is governed by the routing of the lookup message which in
turn depends on the geometry or organization of the hosts.Based on this Peer-to-peer control
can be further classified into unstructured or structured.
Unstructured P2P
In this mode of P2P communication, a peer randomly selects other peers to connect to, lookup
is carried out by flooding or random walks. Gnutella[55], a popular filesharing service, is an
example of an unstructured P2P network.
Structured P2P
In this mode of P2P communication, the peers are a part of a distributed hash table (DHT)
that stores key-value pairs. Each peer and each data item is identified by different unique IDs,
and a hash function maps the keys to the nodes. Each node maintains a routing table, and the
connections among the peers are structured in order to provide guarantees on the maximum
number of hops required to locate the data item. Churn resistance(Joining and Leaving of
hosts) is handled with the use of replication and redundancy. Popular DHTs include CHORD,
Koorde and Kademlia. CHORD
CHORD [65] It is a simple DHT, where keys are mapped to peers with the same IDs based
on consistent hashing using the SHA1 Algorithm. The peers are organized into a ring, with
each node storing, a total of log2N peers, where N is the number of nodes, this includes its
predecessor and successor in the ring. The graph is constructed using the rule that each node
Chapter 2. Botnets and Botnet Detection 12
Figure 2.3: A CHORD Graph with 16 nodes(Image from [68])
i will connect to nodes i − 1, and i + 1 to complete the ring, and have long-range links to
nodes i+2kmodNfork = 1...(log2N)−1. This forms the routing table or finger table of each
node. This is illustrated in Figure 2.3. For key lookup, the node in the finger table with the
closest ID will be asked to search for the key. This happens recursively and the lookup is done
in O(log N) time in a manner similar to binary search, where the distance between the source
node requesting the key, and the node that has the key is halved at every hop. KOORDE
KOORDE [33] It is a DHT similar to CHORD, however peers are organised according to
Figure 2.4: A DeBruijn Graph of 3 length string of alphabet 01 (Image [43])
a DeBruijn graph of constant degree k. A de Bruijn graph represents relationships between
strings. The node set V comprises of every possible string of length n from an alphabet of
Chapter 2. Botnets and Botnet Detection 13
m letters. The edge set E consists of directed edge between a string u to v if the former can
be transformed to the latter by removing the first letter and appending a letter. An example
DeBruijn Graph based on all 3 length strings of 0 and 1 is depicted in Figure 2.4. Consistent
hashing is used to map keys to nodes based on IDs. Lookup takes place by shifting k bits of
the key and checking if such a node exists in the finger table, if not it is again shifted by k until
an appropriate node is found, which recursively applies the same technique to find the file.
Thus lookup is done in O(logk N) hops. KADEMLIA The storm botnet[50] was based on
Figure 2.5: A network partition for node 6 in a Kademlia network of 8 nodes([52])
the Overnet/Kad Network which had a KADEMLIA based DHT, currently the TDL-4 Botnet
uses the Kad Network[23]. In the KADEMLIA DHT[42] peers are organized as the leaves of
a binary tree based on their ID’s. Each node views the entire tree as a partition of subtrees
(called buckets of logarithmically decreasing sizes (Figure 2.5 ) and has an entry to a node in
each bucket. In the example in Figure 2.5 a graph of 8 nodes with Ids 0-7 , the node with ID
(110)2 will have an edge with (111)2, (100)2 and (001)2. Lookup proceeds by computing the
nearest node by using an XOR distance between the key and the node ID’s in the routing table.
This process is then carried out recursively, halving the distance to the actual location allowing
lookup in O(log N) hops.
Chapter 2. Botnets and Botnet Detection 14
2.4 Botnet Detection
The first step in containing the botnet threat is detection. Several methods have been proposed
in literature. A comprehensive survey of these methods have been carried out in [63], [15]. A
taxonomy of detection techniques has been proposed in [80].
2.4.1 Preliminaries
Like Intrusion Detection Systems, Botnet Detection Systems can be classified as signature-
based or anomaly detection.
Signature based: Signature based methods depend on a database of known patterns or ’signa-
tures’ that are compiled from known instances using a certain process. Detection is performed
by repeating the same process that generated the signature to yield a test signature that is then
compared to the database to report an instance if any.
Anomaly Detection: Anomaly detection algorithms model normal behaviour of the system
based on a set of selected features, and flag instances that express abnormal values of one or
more of the selected features. Anomaly based detectors can detect unknown or ’zero-day’ in-
stances, which signature based methods fail to detect.
The above classification was based on the approach of detection, On the basis of the moni-
toring point, the detection systems can be classified as host-based or network-based.
Host-based Detection: The detection system is deployed on the individual end hosts, and
leverages features based on system logs, system state changes, and/or system call traces to
carry out detection.
Network based Detection: The detection system is deployed in the periphery of a network,
where the communication among external and internal hosts can be passively monitored, and
Chapter 2. Botnets and Botnet Detection 15
extract packet and flow based features to discriminate between benign and botnet traffic. Net-
work based methods are easier to deploy than host-based systems, and can observe the im-
portant coordinated network behaviour of botnets. Most of the methods surveyed in the next
section are network based methods and a method can be assumed to network-based unless
explicitly specified to be host-based.
2.4.2 Methods of Botnet Detection
Honeypots and Honeynets
A honeypot is a set of hosts that are made intentionally vulnerable, so as to attract attackers.
Honeypots get infected with various malware, and since they are under the control of the se-
curity community, they can be used to infiltrate a botnet, and study it. Honeynets are networks
of honeypots, to observe the network behaviour of malware, and can be used to detect botnets
in a small scale.
Correlation of various behaviours
A botnet will behave similar to the lifecycle in Figure 2.1, exhibiting two or more of the stages.
Detection methods can detect the presence of each stage, then these detection events can be
correlated to identify botnet activity. Binkley et al.[2] have proposed an algorithm to detect
suspicious IRC botnet channels, by identifying member hosts that exhibit TCP scan-like ac-
tivity.Scan like activity was detected by the use of a metric called TCP work weight, which
counts the fraction of TCP packets that have the SYN or RESET or FIN flags set over the total
number of TCP packets that is computed for every IRC channel detected by parsing the pay-
load of IRC packets. The disadvantage of this method is that it relies on the fact that bots must
use the IRC protocol for C2C. The method relies on deep packet inspection, and thus can be
easily defeated by encryption.It cannot detect botnets that do not rely on scans to proliferate.
A host based approach was devised by Masud et. al. [41]. In this method correlations of mul-
tiple log files are performed. A combination of features from exedump which logs application
Chapter 2. Botnets and Botnet Detection 16
traces, and tcpdump, which logs the network activity of the host are extracted. Botnet com-
mand flows are categorised into three classes leveraging features from both the host level and
network level traces to extract a set of flow based features pertaining to the botnet command
flows. Several machine-learning based classifiers are trained with tagged training data, and the
model is used to detect bot activity when given untagged test traces.
BotHunter [25] is a detection system that models the botnet lifecycle stages – inbound scan-
ning, inbound infection, egg download, C2C communication and outbound scans. It com-
prises of a payload based anomaly detection engine based on 1 gram character distributions
and a TCP scan detection engine implemented as extensions to the SNORT Intrusion Detec-
tion System along with some custom rules to detect exploits, egg download and C2C traffic. A
correlation engine is used to combine the SNORT alerts and produce a final aggregated report
of Botnet Activity.
A more sophisticated extension to BotHunter called BotMiner was proposed in [24] where the
correlation engine in BotHunter is replaced by a clustering algorithm, in which similar ma-
licious flows are clustered. An additional clustering stage groups flows based on flow based
features such as flows per hour (fph), packets per flow (ppf), bytes per packet(bpf) and bytes
per second for an hourly window. The two clusterings are then correlated by checking cluster
intersections.
A graph-theoretic framework to isolate botnets was proposed by Jaikumar et. al. [29] This
work relies on the use of the botnet activity detectors such as SLADE in [25] and [24]. A
weighted graph is constructed ith hosts as nodes, and the existence of an edge and its weight
is determined based on the common expression of the botnet activity. These edge weights are
updated temporally according to a probabilistic model of the joint activity distribution of the
nodes. This weighted graph is then partitioned using a recursive spectral bisection technique
into clusters of nodes belonging to the same botnet.
Periodicity
Bots will periodically connect to the C2C to pull commands (for centralized botnets), or peri-
odically ping nodes in their peer list (distributed botnets)
Chapter 2. Botnets and Botnet Detection 17
Botnets need to leverage the power of the Domain Name Service to obtain the IP address of the
Command and Control Servers. Dagon [10] has proposed a method of detecting abnormally
high or temporally correlated DNS query rates by employing an outlier detection algorithm
based on the Mahalonobis distance. A method devised by Schonewille and Van Helmond
[60] detected bots based on recurring Domain Not Found responses, as these could be domain
names that have been taken down, or generated by a Domain Generation Algorithm.
Girore et al. [21] have developed a method to identify Command and Control flows by exploit-
ing temporal persistence. They operate on the destination end points of packets, extracting
‘atoms‘ which are a tuple of service, port and protocol. The monitor persistence by using a
sliding-window scheme, counting the presence of the atom within each window. An alarm is
raised if the persistence is above a threshold.
Group activity or temporal correlation
Bots that are a part of the same botnet will behave similarly, for instance in a network that has
multiple members of the same botnet, when a command is received, both hosts will respond
similarly, and separated by a very small time difference.
Strayer et al. [66] proposed an approach to detect IRC C2C Channels. The approach involved
narrowing down chat traffic from other traffic by filtering out unlikely traffic at the first stage,
this was followed by a stage which employed machine learning based classifiers to perform
flow based classification of the traffic into IRC or not based on flow characteristics such as
flow duration, congestion window size, average and variance bytes per packet(bpp), bits per
second (bps), packets per second(pps), variance in packet inter-arrival times. This classifier
was trained on IRC traffic. The final stage involved identification of temporally correlated
flows, as bots of the same botnet will exhibit similar response times. This method is tied to
IRC based botnets, further flow randomization strategies can easily defeat this approach.
Lu and Ghorbani [39] have presented a two-stage method that relies on a payload based clas-
sification followed by a novel cross association algorithm that uses character frequencies of
the 256 possible characters of each flow, using the k-means clustering algorithm, and returns
the cluster with a low standard deviation of the character frequencies, under the intuition that
Chapter 2. Botnets and Botnet Detection 18
human chat activity is very diverse as compared to bot activity. This approach is tied to IRC
based botnets, the need to do Deep Packet Inspection makes the method difficult to work at
high speed traffic. Encryption can easily defeat the features exploited.
Gu et al. have proposed BotSniffer [26] which performs spatio-temporal correlation of net-
work traffic to detect botnet command and control servers and infected nodes. It focusses on
IRC and HTTP traffic, and identifies if there is communication that exhibits group activity –
where a number of hosts send traffic within a given time using a score computed by a thresh-
old random walk based algorithm. The content similarity between the flows is measured via
n-gram analysis.
Choi et al. have presented BotGAD [4] which exploits the group activity exhibited by bots in
making DNS queries. In this work the similarity,periodicity and intensity of botnet DNS query
behaviour is exploited. The similarity between querying patterns is computed by standard
measures such as Kulczynski or Jaccard coefficient, the periodicity by the Euclidean distance,
bot hosts are then identified by appropriate thresholding of the three metrics. The method re-
lies only on DNS traffic, and is agnostic to the C2C protocol used by the botnet. However P2P
based botnets need not rely on the use of DNS and may escape detection.
P2P Botnet Detection
Detection of P2P traffic poses several challenges, P2P networks normally employ strong en-
cryption, and use random port numbers, including ports reserved for well known services and
try to blend in with normal traffic in order to avoid detection [63]. The methods described in
this section detect C2C flows, i.e. the flows of the P2P overlay network the bots are a part of.
Yen and Reiter have proposed a method to differentiate between file-sharing hosts and bots
[78]. In this work the differences between file-sharing hosts and bots such as large data vol-
umes, rapid churn associated with file sharing hosts and the temporal similarity of bots are
exploited. To detect temporal similarity, they construct a histogram for each host, and cluster
histograms on the basis of Earth Movers Distance.
Zhang et al. have described a method to detect stealthy P2P botnets [81]. Their approach is
a multi-stage approach which rely on reduction of flows by retaining flows of nodes which
Chapter 2. Botnets and Botnet Detection 19
exhibit many failed outgoing connections, followed by clustering using flow based features,
followed by filtering on the basis of temporal persistence, finally relying on the overlap of
peers among the nodes in the clusters and traffic similarities to identify P2P bots.
Jiang et al. have proposed a method to detect P2P botnets by discovering flow dependencies in
C2C traffic [32]. Dependencies of flows are extracted by identifying pairs of flows that occur
together many times in a given observation time, and the extracted two-level dependencies are
used to obtain higher level dependencies by combining flows. Flows are then clustered based
on the Jaccard similarity of the extracted flow dependencies.
Coskun et al. have described a graph-based method to identify other members of an unstruc-
tured P2P botnet within a network, when given a known bot[9]. A mutual contacts graph is
constructed, and a dye diffusion from the source node is simulated. Other members of the
same botnet are identified by thresholding on the final dye concentrations on the nodes.
2.5 Detection of Structured P2P Botnets in Large Scale Net-
works
Dagon et al. [11] carried out a graph-theoretic analysis of the effectiveness, efficiency and
robustness of the above topologies, modelling the above topologies as random graphs. It was
concluded that topologies based on structured P2P systems offer good resilience. Davis et al.
[13] analysed the performance of unstructured and structured P2P topologies, especially their
behaviour to random, tree-like and global information based disinfection strategies, and con-
clude that structured P2P topologies are ideal mechanisms for botnet command and control,
as they provide a good trade-off between efficiency and resilience. Most of the existing botnet
detection approaches were designed for and deployed in small networks, like a campus or an
enterprise. Botnets are a global threat to the security of the Internet, and the Internet Service
Providers (ISP)’s have good reason to be concerned, as their precious bandwidth is being put
to misuse, thus it is in their interest to rid the Internet of botnets.
As shown by Davis et al.[13] and with examples like the Storm Botnet and the indestructible
TDL-4 botnet, structured P2P based botnets can be a serious threat, and efforts must be made
Chapter 2. Botnets and Botnet Detection 20
for their detection and removal.
Jelasity et al. showed limitations of local approaches in the detection of structured P2P
botnets[30]. They show that the visibility of P2P botnet traffic can be made very small if
botnets adopt some strategies. They have proposed a new overlay topology based on the exist-
ing CHORD topology, but with clusters in such a way that the links touch the smallest possible
number of routers. They conclude that automated detection of P2P botnets can be achieved
only with cooperation among the major ISP’s and that future research should target the devel-
opment of large scale P2P detection algorithms.
The primary challenge in deployment of a botnet detector at the infrastructure level is the large
velocity of data, for which the methods discussed earlier in this section will not be effective
owing to their lack of scalability.
To handle such large volumes of data, an effective detection method must rely on primarily a
reduced or abstracted form of it such as a graph of hosts, with the presence of an edge between
two hosts if there is any data communication between them. The edge can be unweighted, and
independent of the protocol or size of the communication. Such an abstraction is very easy
to construct as very little of the packet needs to be looked at, and storage requirements are
reduced, as even header information is not stored.
The important question is whether this data retains enough features to enable the detection of
structured P2P botnets.
BotTrack, proposed by Francois et al. [20] is a method that works on a directed traffic graph
constructed from NetFlow traces, and computes the hub and authority centrality of each host,
and clusters hosts based on these values using the DBSCAN algorithm.
2.5.1 BotGrep
Nagaraja et al. have proposed BotGrep [44]. In this work structured P2P botnets are differen-
tiated from background traffic using only the connectivity features. It exploits the concept that
structured P2P botnet subgraphs are fast-mixing while the subgraph of normal or the back-
ground traffic is not. The state probability vector, associated with random walks on the graph
Chapter 2. Botnets and Botnet Detection 21
qt, whose each component represents the probability of being in vertex i after t steps, will con-
verge to the stationary distribution of the graph in a very small number of steps owing to the
expansion properties, which are absent in the topology associated with regular client-server
traffic. This is achieved in a two step algorithm which includes a fast prefiltering step, and
a relatively slower refinement step which aims at removing false positives. In this work, it
is assumed that honeynet nodes are available to distinguish between file-sharing traffic and
botnets.
• Prefiltering: The first stage of the algorithm runs short random walks of log2(N)
steps,where N is the number of nodes in the graph. The random walks are computed
according to the standard transition probability matrix Pij = 1di
if there is an edge from i
and j in the graph, where di is the degree or number of connections of a node. The state
probability vector qt is computed by qt = qt−1P . The resulting vector is proportional to
the degree of each node, and a quantity si =(qt
di
) 1r,(r is an input parameter, assumed
to be 100 in the paper) which penalizes the state probabilities of the high degree nodes
is used as a feature vector for the X-means clustering algorithm, which can automati-
cally determine the correct number of clusters. The X-means algorithm searches for the
appropriate number of clusters between 2 and kmax ( kmax is an input parameter and
assumed to be 20 in the paper). The prefiltering step determines the detection rate of the
algorithm.
• Refinement: The cluster from the prefiltering step containing the honeynet nodes is then
refined by a recursive application of a bisection algorithm. The bisection algorithm is
based on a probabilistic model defined on the basis of a set of traces T of random walks
of log2(N) steps. These traces are start and end vertices obtained by performing random
walks on a special transition probability matrix Pij = min(
1di, 1dj
)when there is an
edge between node i and j. The probabilistic model assigns a probability of generating
the current set of traces from a given set/ cut of nodes. Using the Bayes theorem, the
probability that the given set of nodes is a botnet is computed by using by drawing
samples from the probabilistic model using Metropolis-Hastings sampling.
Chapter 2. Botnets and Botnet Detection 22
As discussed in Chapter 1, Nagaraja et al. compared BotGrep[44] to several Community
Detection Algorithms (CDA) and concluded that CDA’s will not be able to scale well enough
to be able to handle this data. Community Detection Algorithms have received a lot of recent
attention, and there have been several algorithms proposed that can theoretically handle large
graphs. In the next chapter a survey of the popular Community Detection Algorithms will be
carried out with special emphasis on scalability, and experiments will be carried out in order
to identify a candidate algorithm that can be used to detect structured P2P botnets.
Chapter 3
Community Detection Algorithms
3.1 Introduction
There has been a growing interest in network science, various real world systems have been
modelled as graphs or networks. Apart from the existence of the power law degree distri-
butions and small-world properties, real world networks were also found to have tightly knit
clusters of nodes or communities [22]. Community Detection (or graph clustering) algorithms
aim to detect these communities/clusters of nodes, given the graph.
A structured P2P botnet should have a high internal connectivity among its members so as
to achieve robustness against targeted and random failures. This provides the motivation for
using community detection algorithms in order to detect them.
Over the years there has been tremendous activity in the field of community detection algo-
rithms and there have been a number of methods proposed. The aim of this chapter is to survey
some of the related work in literature,and identify a candidate method for structured P2P bot-
net detection. Graphs and related terminologies that will be used for the rest of the thesis is
provided in Section 3.2. Community Structure is introduced and some popular partition qual-
ity functions are defined in Section 3.3.Some of the popular classes of Community Detection
Algorithms are described in Section 3.4. Section 3.5 aims at identifying a candidate algorithm
that can be applied to detect structured P2P botnets.
23
Chapter 3. Community Detection Algorithms 24
3.2 Graphs
A graph (or network) G(V,E) is a set of vertices (or nodes) V , and a set of edges (or links).
The number of nodes (or order) of the graph is the number of elements of set V and is denoted
by |V |
The number of edges (or size) of the graph is the number of elements of set E and is denoted
by |E|
3.2.1 Edges, Directionality and Weights
An edge is a tuple (u, v) : u, v ∈ V , indicating that there is a connection between node u and
node v. The edge set E ⊆ V × V .
In general edges in a graph have directionality, i.e an edge from u to v need not imply v is
connected to u. A graph is said to be undirected if there is no directionality in any edge, and
there is no difference between (u, v) or (v, u)
A real valued number or weight can be associated with every edge. An unweighted graph has
no number associated with an edge, simply a boolean value of 0 or 1 indicating the presence
or absence of an edge.
Unless mentioned otherwise, all graphs in this thesis can be assumed to be undirected and
unweighted.
3.2.2 Adjacency Matrix, Degree and Transition Probability Matrix
An undirected and unweighted graph can be represented by a symmetric binary valued matrix
called the adjacency matrix.
Aij =
1 if there is an edge between i and j
0 otherwise
The number of connections or degree of a vertex is given by di =
∑jinV Aij
The transition probability matrix P associated with a graph is given by P = AD−1 where D
is the diagonal matrix with elements Dii = di. Each element represents the probability Pij of
a random walker to jump from vertex i to vertex j.
Chapter 3. Community Detection Algorithms 25
3.2.3 Degree Distributions and Random Graph Models and Assortativity
The distribution of the degrees of a graph or P (di = k) is the degree distribution of a graph.
3.2.4 Power Laws or Scale Free Networks
Most real world networks follow a power-law or a Pareto distribution where the degree of a
node is given by
P (di = k) = Ck−γ
where C is a constant and γ is an exponent that controls the ’skewness’ of the distribution. The
skewness of the degree distribution results in the presence of hubs which account for a large
amount of the edges of the graph. Such networks are also called scale-free.
3.2.5 Random Graph Models
There have been models proposed which aim to generate graphs via a random process. The
popular models are described here.
Erdos-Renyi(ER) Graph
The Erdos-Renyi Model or the ER model attaches a uniform probability p to every edge, thus
the probability of existence of an edge Pij = p.
The degree distribution of an ER graph is binomial
P (di = k) =
|V | − 1
k
pk(1− p)|V |−k−1
Configuration Model
The configuration model generates a graph given the degree sequence i.e the degrees of each
node. It is a process by which an equivalent random graph for a given graph can be created by
rewiring the edges. This rewiring process is done by considering each edge as two end stubs,
Chapter 3. Community Detection Algorithms 26
each free stub is then randomly connected to another free stub. The probability of an edge is
given by Pij =didj2|E|
3.2.6 Assortativity
The degree assortativity coefficient of a graph r was proposed by Newman[45] to study the
degree-degree mixing patterns in complex networks. The degree-degree mixing patterns indi-
cate whether an average node in the graph connects to other nodes of similar degree (assortative
mixing) or whether it connects to nodes of dissimilar degrees (disassortative mixing). It is the
Pearson correlation coefficient between the degrees of the endpoints of each edge in the graph,
and is given by
r =1
σ2q
∑jk
jk(ejk − qjqk) (3.1)
where, j and k are the degrees of vertices on either end of an edge, qk is the distribution of
excess degrees given as as qk = (k + 1)pk+1/∑
j jpj , pk is the probability of a randomly
chosen vertex to have degree k, σ2q is the variance of the distribution qk and ejk is the joint
probability distribution of the remaining degrees of the two vertices at the either end of a
randomly chosen edge.
3.2.7 Paths, Connected Components, and Betweenness Centrality
A path is a sequence of edges.A shortest path or geodesic path between two vertices s and t
is the smallest number of edges that have to be traversed to reach t from s.
A set of nodes C ⊆ V of a graph G(V,E) is a connected component if there is a path from
every node in the set to every other node in the set
The betweenness centrality of a node is given by
B(i) =∑
u,v∈Vσu,i,vσu,v
where σu,v represents the number of geodesic paths from node u to node v and σu,i,v represents
the number of geodesic paths from node u to node v through node i
Chapter 3. Community Detection Algorithms 27
3.2.8 Subgraphs, Covers and Partitions
A subgraph Gs(C,EC) corresponding to a set of nodes C ⊆ V of a graph G(V,E) is a graph
G(C,Es) consisting of edges Es = {(u, v) : u, v ∈ C, (u, v) ∈ E}
A cover P is a set of subsets of set V {Ci ⊆ V }i=1···k such that⋃i=1···k Ci = V
A partition P is a set of subsets of set V {Ci ⊆ V }i=1···k such that Ci ∩ Cj = φ∀i, j and⋃i=1···k Ci = V . It is thus a set of mutually disjoint subsets of V whose union gives the entire
set V .
3.3 Community and Community Structure
3.3.1 Definitions
Community: A community (or cluster or group) can be defined in several ways, and there is
no universally accepted definition. However it can be intuitively understood as a set of nodes
that are densely connected to each other, and relatively sparsely connected to the rest of the
graph.
Community Structure: The community structure of a graph is the set of communities in a
graph, it can be represented as a partition-where each node belongs to only one community. A
discussion of overlapping communities, where a cover of the graph is desired is out of scope
of this thesis.
A community detection algorithm thus looks to identify a partition P = {C1 · · ·Ck} of
the graph such that the nodes of each community are densely connected to each other, and
relatively sparsely connected to the rest of the graph.
3.3.2 Partition Quality Functions
A partition of the graph P has to be scored so as to quantify its quality. As there is no uni-
versally accepted definition of a community, there have been several measures to quantify the
quality of partitions. Some popular quantifications include – Cut, Ratio Cut, Normalized Cut
Chapter 3. Community Detection Algorithms 28
Figure 3.1: Communities in a Graph (Image from [31])
and Modularity.
Cut: – The number of inter-community edges in a partition of a graph.
Cut(P ) =∑C∈P
∑i∈C,j /∈C
Aij (3.2)
The partition of a graph into two communities that minimize cut (the min-cut problem) can be
solved in polynomial time by computing the max-flow [17]. The problem with this measure
is that there is no account taken of the internal density of the clusters, leading to imbalanced
trivial partitions of one node in one cluster and the other nodes in the other cluster.
Ratio Cut: The ratio cut overcomes the issue of high scores given to a partition by Cut to
unbalanced partitions, dividing the number of inter-community edges of each community by
the size of the cluster and the size of the rest of the graph
Chapter 3. Community Detection Algorithms 29
RatioCut(P ) =∑C∈P
∑i∈C,j /∈C Aij
|C| |V − C|(3.3)
The disadvantage of this is it does not consider the internal density of the community.
Normalized Cut [62]: Normalized Cut was proposed to account for the density of the cluster
as well by normalizing the number of inter-community edges of each community by the total
degree or volume of the community.
NormalizedCut(P ) =∑C∈P
∑i∈C,j /∈C Aij
vol(C)(3.4)
where V ol(C) =∑
i∈C di
The disadvantage of the above quality functions Cut, RCut, and NCut is the fact that they
all equal their highest value 0 when the partition consists of the whole graph as a community.
Modularity [48]: This quality function compares the community structure of a graph to a
random graph that is not expected to have any community structure. It compares the number
of internal edges in the graph to the expected number of edges in an equivalent null model. In
its most general form it can be written for a partition P as
Modularity(P ) =∑C∈P
1
2 |E|∑i,j∈E
(Aij − Pij) (3.5)
Where P is a null model. The null model typically used is the configuration model, which is a
random graph of the same degree sequence. This leads to the standard definition of modularity
Modularity(P ) =∑C∈P
1
2 |E|∑i,j∈E
(Aij −
(didj2 |E|
))(3.6)
Chapter 3. Community Detection Algorithms 30
The advantage of modularity is that it considers both the internal density as well as the external
sparsity of the community.
3.4 Community Detection Algorithms
A large number of community detection algorithms(CDA) exist, several of them have been
surveyed in [58, 18, 8]
The important classes of CDA will be surveyed in this section, with an emphasis on the effi-
ciency of the algorithm for large networks. The treatment of algorithms that deal with overlap-
ping communities (i.e when a cover of the graph is desired – a node can belong to more than
one cluster) is beyond the scope of this thesis.
Graph Partitioning Algorithms
These classes of algorithms primarily focus on min-cuts. i.e partitioning the graph into two or
more groups with the smallest number of edges between clusters. Another property of graph
partitioning algorithms is that they are generally designed to yield balanced clusters.
The Kernighan-Lin Algorithm [35] is such an algorithm, which takes in a random bisection of
the graph into two equal sized sets, and swaps vertices among these sets so as to minimize the
number of edges. The procedure takes O(|V |2 log |V |) time to yield a bisection of the graph.
To break the graph into more than two clusters, the procedure has to be applied recursively.
The FM-heuristic proposed by Fiduccia et al. [16] in reduces this to O(|E|) by considering
movement of a single node to neighbouring communities instead of node swaps.
METIS[34] is a popular graph partitioning algorithm that builds on the FM heuristic[16], and
integrates it into a multi-level algorithm. A multi-level algorithm works on creating a sequence
of coarse graphs obtained by grouping nodes and edges. The METIS algorithm coarsens the
graph by performing edge matching, followed by the FM-heuristic at the coarsest graph recur-
sively to obtain the required number of clusters, followed by expansion of the edges at each
level in the uncoarsening phase, to obtain the final cut.
Chapter 3. Community Detection Algorithms 31
The main drawbacks of the graph partitioning algorithms is the need to set the number of
clusters and a parameter to tweak the balance or the sizes of the clusters. This is difficult to
do in practice as multiple runs with different values of the number of clusters will potentially
increase the running time. Thus METIS and other such algorithms are not suitable candidates
for automatic community detection in large graphs.
Divisive Algorithms
Divisive algorithms begin with all the nodes of the graph in one community and iteratively di-
vide the communities by edge removal. Girvan et al. proposed the GN algorithm[22], which is
a seminal work in Community Detection Algorithms. The algorithm involves iterative removal
of edges with high edge betweenness. The edge betweenness is the sum of the betweenness
centrality of the nodes. It scores an edge high if a large number of shortest paths pass through
it. Such edges with high edge betweenness are likely to be bridge or intercommunity edges.
A hierarchical clustering is obtained that can be pictorially represented as a dendrogram. The
appropriate partition can be obtained by cutting the dendrogram at some level, based on some
partition scoring function such as modularity[48]. This method runs in time O(|E|2 |V |) for
unweighted graphs due to recomputation of the edge betweenness O(|E|) times, with each
computation involving an evaluation of all source shortest paths which runs in time , which is
much too slow for even medium size graphs.
Radicchi et al. [53] proposed a divisive algorithm that removes edges with low edge-clustering-
coefficient, which quantifies the number of triangles an edge is a part of, with the intuition
that intercommunity edges will participate in fewer triangles as compared to intracommunity
edges. The edge clustering coefficient is a local measure, and unlike all pair shortest paths,
can be computed more efficiently, with an overall running time of,and the algorithm again pro-
duces a hierarchical clustering.
Owing to their high time complexity, divisive algorithms are not good candidates to detect
communities in large graphs.
Chapter 3. Community Detection Algorithms 32
Spectral Algorithms
Spectral algorithms rely on the projection of the graph into the eigenspace, and rely on the
optimization of the relaxed version of the combinatorial optimization problem. A detailed in-
troduction to Spectral Clustering is found in [74]
Optimizing the Ratio Cut exactly is NP hard. An approximate solution can be obtained using
spectral methods. The second smallest eigenvector (also known as the Fiedler vector) of the
graph laplacian L = D − A, where D is the diagonal matrix of node degrees, can be used
to partition the graph into two groups such that ratio cut is minimized. The eigenvector can
be computed in O(|V |2 log |V |) time using the Lanczos[38] method. The bipartition can be
obtained by assigning all the nodes corresponding to positive components of the eigenvector
in one group, and the negative to the other. For more than one cluster, the procedure can be
repeated recursively on each cluster.
The problem of optimizing the Normalized Cut exactly in order to obtain communities is NP-
complete [62]. A spectral method based on the first k eigenvectors of the transition probability
matrix P can be used to obtain k communities that have minimum normalized cut. The k eigen-
vectors once obtained are used as feature vectors for the k-means clustering algorithm[40],
which provides the final communities.
The main drawback of the above spectral methods is the time complexity, as the computation of
the eigenvectors is slow, another important drawback is the need to set the number of clusters,
though some heuristics that consider the largest differences between successive eigenvalues
can be used [74].
Dhillon et. al. [14] proved the equivalence of spectral clustering and weighted kernel k-means,
and proposed a multi-level algorithm called GraClus to produce a clustering that optimizes nor-
malized cut, this brings down the time complexity of the algorithm to O(|E|). However a key
limitation of this and the other spectral algorithms is the need to specify the number of clusters.
Modularity can be optimized by a spectral relaxation as well. This was proposed by Newman[47].
Similar to the existing spectral methods, the leading eigenvector of the modularity matrix,
given by
Bij = Aij − didj2|E|
Chapter 3. Community Detection Algorithms 33
can be used to partition the graph into two communities based on the signs of the components.
The communities are further recursively divided applying the same method, stopping when
there is a decrease in the overall modularity of the partition. In the original work, a method to
improve the bisection was proposed by a modified version of the Kernighan-Lin method[35]
to optimize modularity. The spectral method has been demonstrated to produce partitions of
very high modularity. This method scales as O(|V | |E|), and can be used only for medium
sized graphs.
Random Walk Based Algorithms
Intuitively, owing to the large number of paths through internal vertices, a random walker will
spend more time within a community than outside it, this fact can be used to partition a graph
into communities.
Early work on random walk based algorithms were modifications of the divisive edge removal
algorithms. An edge betweenness measure based on random walks was defined by Newman
et al[46], which takes O(|V |3) time to compute, is prohibitive for large graphs.
Pons and Latapy proposed Walktrap[49], an agglomerative hierarchical clustering algorithm
based on a distance measure that is computed by performing short random walks. The distance
between two pairs of nodes is the euclidean distance of the rows of the transition probability
matrix corresponding to the two nodes. The hierarchical clustering agglomeration is done on
the basis of the Ward’s method, and the dendrogram is cut at that level that has maximum
modularity. The method scales as O(|V |2 log |V |).
Van Dongen proposed the Markov Clustering Algorithm (MCL)[71], which operates on the
transition probability matrix P = AD−1 of a graph. It relies on iterative application of two
operators on P - expansion and inflation. The expansion parameter raises the power of P to a
specified value t, which implies the matrix now gives the transition probabilities of a random
walker to reach to a vertex j from i in t steps. Under the intuition that for small values of t, the
random walker again is likely to be within the community. The inflation operation squares each
element of the matrix, and renormalizes each row to obtain a stochastic matrix. The inflation
step is meant to enhance the probabilities to intra-community vertices, allowing the random
Chapter 3. Community Detection Algorithms 34
walker to spend a longer time within the community. Intuitively the whole process iteratively
spreads flow on the graph, enhancing the probability for flow to accumulate within commu-
nities, and diminishes the probability of the flow to travel along inter-community edges. On
convergence, the connected components of the graph corresponding to the final matrix are the
communities. A naive implementation of the algorithm takes O(|V |3) time, an optimization
retaining only the top k non-zero elements at each time reduces this to O(|V | k2) time. A
multilevel version of this algorithm, MLR-MCL was recently proposed by Sataluri et al[57].
which further speeds up the computation by performing MCL only on the coarsest graph, al-
lowing MCL to be used for larger graphs.
Rosvall and Bergstrom have proposed Infomap[56], which partitions nodes by finding an op-
timal compression of random walk paths on the network. This is done by minimising the
average description length of random walks. The authors originally use a simulated annealing
based algorithm, which is too slow to be able to handle large graphs, but a recursive greedy
method, was also proposed to optimize this objective in time O(|E|).
Label Propagation
Raghavan et. al. have proposed a simple algorithm that detects community structure by prop-
agating labels iteratively[54]. Initially all nodes have different labels, then in each iteration,
each node gets the label corresponding to the one the majority of its neighbours have. For
each iteration the order of processing of nodes is randomized. The algorithm scales as O(|E|).
Label Propagation finds a partition that is a local optimum of∑
C∈P1
2|E|∑
i,j∈E Aij ,which is
modularity with a null model of Pij = 0∀i, j ∈ E, whose global optimum corresponds to all
vertices in one community[69].
Greedy Modularity Optimization Algorithms
Modularity and other objective functions can also be optimized using methods such as Simu-
lated Annealing, Genetic Algorithms, Extremal Optimization[18], and achieve solutions very
close to the global optimum. However these methods are very slow, and unsuitable for large
graphs. For handling large graphs, greedy algorithms that can quickly provide approximate
Chapter 3. Community Detection Algorithms 35
solutions have been proposed.
The CNM algorithm proposed by Clauset et. al. [6] does a hierarchical optimization of mod-
ularity. All nodes start in isolated communities initially, and in each stage neighbouring com-
munities are merged according to the best increase in modularity. The algorithm scales as
O(|E| log2(|V |)).
A faster multi-level algorithm, popularly known as the Louvain method was proposed by Blon-
del et al. [3]. It begins with each node in its own community, and instead of merging commu-
nities, iteratively moves nodes from one community to the one that gives the best increase in
modularity. On convergence, each community is collapsed to form a node, leading to a smaller
graph. This smaller graph is then subjected to the same method to obtain communities for the
next stage. The algorithm scales as O(|E|) and scales to very large graphs.
3.5 Efficiency Comparison of Community Detection Algo-
rithms
In order to handle the large graphs typically associated with traffic, the algorithm must scale
as O(|E|) or similar.Among the algorithms discussed in Section 3.4 the Louvain Method and
Label Propagation, Infomap and MLR-MCL are suitable candidates.
In order to test and compare the running times of the algorithms, benchmark graphs proposed
by Lanchichinneti et al.[37] are generated and used to test the CDA.
3.5.1 Dataset - LFR Benchmarks
The LFR benchmark graphs involve generating a network whose degrees follow a scale-free
or a Pareto distribution like most real world networks[7]. Edges are added according to the
configuration model. In the next step, nodes are assigned to communities randomly such that
the community sizes also follow a Pareto/Power Law distribution. To create the final bench-
mark each inter-community edge is rewired to form an intra-community edge with probability
1 − µ, where µ is the mixing parameter. We generate LFR graphs of various increasing sizes
Chapter 3. Community Detection Algorithms 36
and compare the running times of the selected algorithms in Figure 3.2
Figure 3.2: Comparison of CDA on LFR Benchmarks
Figure 3.3: Comparison of CDA on LFR Benchmarks - Louvain vs CNM
3.5.2 Discussion and Algorithm Selection
It can bee seen in 3.2 that though Label Propagation(LP) scales as O(|E|), its constants are
not as good as the Louvain, which is about 50 times faster. method, as observed in Figure.The
Chapter 3. Community Detection Algorithms 37
MLR-MCL algorithm works faster than Label Propagation, but still slower than the Louvain
method by about 15 times. Though the Infomap method is loosely based on a method similar
to Louvain, it has an additional step where it recursively applies the technique to the clusters
obtained thus making it slower than the original method by about 10 times.
The CNM (discussed in 3.4) algorithm was the fastest CDA tested by BotGrep [44]. Figure
3.3 compares the running time of the Louvain to the CNM method. The Louvain method is
found to be over 200 times faster than the CNM method. This further reaffirms the motivation
of finding scalable CDA for Structured P2P Botnet Detection.
Thus the results indicate that the Louvain method is the most efficient among the candidate
algorithms, and it is potentially a good candidate for application to detect structured P2P bot-
nets.
3.5.3 Conclusion
In this chapter various terminologies related to graphs have been defined. Community Struc-
ture was described, and several partition quality functions were reviewed. A brief survey of
the various classes of Community Detection Approaches was provided. A candidate set of
scalable Community Detection Algorithms for detection of Structured P2P Botnets was se-
lected, and compared on the standard community detection benchmarks in order to compare
their efficiency. The Louvain method was found to be the fastest among this set, and identified
as a potentially viable algorithm to detect Structured P2P botnets in terms of efficiency.The
next chapter further explores the suitability of the Louvain method by applying it to datasets
for Botnet Detection generated according to [44].
Chapter 4
Identifying Structured P2P Botnets using
the Louvain Method
4.1 Introduction
Community Detection Algorithms were introduced in Chapter 3 which surveyed various pop-
ular Community Detection Algorithms (CDA). A set of candidate algorithms that could the-
oretically scale well enough to be able to handle the large traffic graphs constructed from
backbone level traces were identified. These candidate algorithms were then tested on the
standard benchmarks for Community Detection Algorithms and the Louvain method emerged
as the fastest algorithm among them. The aim of this chapter is to apply the Louvain method
in order to identify structured P2P botnets. This will be done using only the topological infor-
mation contained in an undirected, unweighted graph constructed from packet traces.
The rest of this chapter is organized into 3 other sections. Section 4.2 will describe the Louvain
method in detail. Section 4.3 describes the procedure that will be used to generate graphs with
embedded structured P2P botnets for the purpose of experimentation. Section 4.4 describes
an application of the Louvain method on the generated data and discusses the applicability
of the Louvain method to detect structured P2P botnets. Section 4.5 describes the resolution
limit of modularity optimization, and describes the objective function Stability. The results of
the application of Stability optimization by the Louvain method on the generated datasets are
38
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 39
discussed in Section 4.6.
4.2 The Louvain Method
The Louvain method is a greedy modularity optimization algorithm proposed by Blondel et.
al. [3]. Modularity, as already discussed in Section 3.3 is given by -
QModularity(P, t) =1
2 |E|∑C∈P
∑i,j∈C
(Aij −
didj2 |E|
)
The method is a multi-level/multi-stage iterative process. It consists of two main steps – a
modularity optimization step, and a community aggregation step. The algorithm is defined
in Algorithm 1 and illustrated in Figure 4.1. In each level or stage, the objective function is
greedily optimized by iteratively moving single nodes to communities that yield the highest
gain. The greedy optimization will result in a local minima of the objective function, the
communities obtained are collapsed into a graph on which the same greedy optimization is run
to merge communities. This allows the algorithm to climb out of the local maxima.
Figure 4.1: The Louvain Method (Image from [3])
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 40
Algorithm 1: The Louvain MethodInput : The graph G
Output: The partition P of G into communities
begin
OldGraphSize←− |V |
Gw ←− G
P ←− φ
while OldGraphSize 6= |VGw | or |VGw | = |V | do
OldGraphSize←− |VGw |
Ptemp ←− GreedyOptimizationQModularity(Gw)
P ←− PartitionExpansion(G,Gw, Ptemp)
Gw ←− CommunityAggregation(G,P )
end
end
4.2.1 Greedy Modularity Optimization
Modularity (discussed in Section 3.3) is optimized by a greedy procedure in each stage of the
Louvain method.In this step, each node of the graph is initially assigned to its own community.
Then iteratively each node is removed from its community, and moved to that community that
provides the maximum gain in the modularity. This is outlined in Algorithm 2.
The time complexity of the algorithm is determined by the gain computation step. If the
computation of gain can be done in constant time, the time complexity of this step is O(|E|).
It was shown in [3] that the computation of gain in modularity for the movement of one node
i from its community to the community of a neighbouring node j is given by
∆QModularity(i, j) =1
|E|(IntEdgesNode(Cj, i)− IntEdgesNode(Ci, i)−
di2 |E|
(V ol(Ci)− V ol(Cj))
(4.1)
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 41
Algorithm 2: GreedyOptimizationQModularity
Input : The graph G
Output: The partition P of G into communities
begin
initialize P such that each node is in its own community
Qprev ←−Modularity(P )
Qcurrent ←− 1
while Qcurrent > Qprev do
for node i ∈ V do
MaxGain←− 0
for node j adjacent to i do
MaxGain←− max(MaxGain,∆QModularity(i, j))
end
move the node i to that community that results in max gain
end
end
end
where V ol(C) =∑
i∈C di is the sum of degrees of the nodes in the community C and
IntEdgesNode(C, i) =∑
k∈C Aik is the number of edges node i has to other nodes in commu-
nity C. The V ol(C) term can be maintained for each community inO(|P |) space, and updated
in constant time, while IntEdges(C, i) can be computed during the pass along the neighbours
of the node i, thus enabling the gain to be computed inO(1) time. This implies that the Greedy
step will scale as O(|E|) when used to optimize modularity.
4.2.2 Community Aggregation
The application of just the objective optimization step results in solutions of a large number of
small communities as a local optimum is reached. In order to climb out of the local optimum,
the small communities must be merged.
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 42
In this step, each community in the partition is collapsed into a node. A weighted graph is then
constructed with the collapsed communities, with self loops added to account for the internal
edges of each community. Edges connecting several nodes of the same community to the same
neighbouring community are collapsed by setting the edge weight as the number of such edges.
This step can be implemented efficiently in O(|E|) time, as all the edges have to be traversed
only once. Thus the overall time complexity of the Louvain method is O(|E|), as the number
of iterations are usually small.
4.3 Dataset Generation
In order to evaluate the applicability of the Louvain method for structured P2P botnet detection,
it is tested on graphs constructed from real world traffic traces of different durations with
synthetic botnet topologies embedded in them. The background, botnet graph construction
and embedding are done according to the procedure followed by Nagaraja et. al [44] in their
evaluation of BotGrep. The description of this dataset generation process is provided in this
section.
4.3.1 Network Traces
Network Traces either include dumps or logs of every packet sent, or are in the format of
NetFlow. NetFlow[5] is a format for exporting Network Data. A router capable of exporting
data in the NetFlow format aggregates all the packets with the same 5-tuple of Source IP
Address , Destination IP Address, Source Port, Destination Port and Transport Layer Protocol
(TCP or UDP) i.e a flow, and returns this along with the timestamps of the duration of the flow,
the number of packets and number of bytes sent. NetFlow traces captured at the core routers
of the Abilene ISP, and packet traces captured at an Internet Point of Presence (PoP) provided
by CAIDA are used for evaluation.
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 43
Abilene NetFlow Traces
NetFlow traces captured at three core routers of the Abilene ISP (Now the Internet2 Network)
located at the Salt Lake City UT, Washington D.C, and Chicago IL. The data was captured
over 1 day on the 1st of December 2008 and has no payload information. The three traces will
be referred to as SALT, WASH and CHIC. Apart from the entire 1 day trace, a 1 hour slice
is also extracted. The data is aggregated as /24, i.e the last 8 bits of the Internet Protocol (IP)
address has been zeroed out for anomymization purposes.
CAIDA Traces
A larger set of packet traces captured at a OC192 (10Gbps) Internet Point of Presence (PoP).
The data was captured for 1 hour between 13:00 and 14:00 on the 17th of February 2011 on a
backbone link between Chicago and San Jose City as a part of the 2011 Anonymized Internet
Traces Dataset [75] and contains no payload information.
4.3.2 Background Graph Construction and Properties
The graph is constructed from only the information extracted from the Internet Protocol(IP)
layer of the network traces. Only the source address and destination address fields of each
packet of the trace are parsed. The nodes are IP addresses, and there is an edge added between
nodes A and B if there is a packet sent from an IP address A to IP Address B or vice-versa,
i.e the directionality and the number of packets, number of bytes, the port numbers etc. are
ignored to create an undirected and unweighted graph with no self loops and no multiple edges.
The following Table 4.1 contains the details of the graphs extracted from the network traces
described in Section 4.3.1. These graphs will be used as the background graphs for the rest of
the chapter and the thesis. The mean degree of the graph µ(G) =∑
iinV di is also computed.
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 44
Background Graph |V | |E| Mean Degree
WA-1H 119203 967002 16.22
CH-1H 206390 1639800 15.89
WA-1D 217504 5188343 47.71
CH-1D 297732 12156830 81.66
CA-CH-1H 2716915 8867080 6.53
CA-SJ-1H 8427335 28109401 6.67
Table 4.1: Properties of Graphs extracted from the Network Traces
Densities of the Background Graphs
The Abilene graphs (Mean degree > 15) are much denser than the CAIDA graphs (Mean
Degree ≈ 6). This is due to the /24 aggregation, which results in a smaller number of hosts.
This density difference is important, as Community Detection Algorithms roughly depend
on the fraction of internal edges of a community to the fraction of external edges, thus the
above datasets provide background graphs with a wide range of densities for evaluation of the
applicability of the Louvain method. Among the Abilene graphs the 1-Hour and the 1-Day
show variation in density as 1 Hour traces have Mean degree < 17, while the 1 Day traces
have Mean Degree > 40.
4.3.3 Structured P2P Graph Generation
Since the focus is on modelling and exploiting the connectivity features of only the Command
and Control (C2C) communication of a structured P2P botnet we generate graphs of popular
Distributed Hash Tables (DHT) KADEMLIA, CHORD and KOORDE (discussed in Section
2.3.2) based on the routing table stored at each peer, i.e an edge is added between two hosts A
and B if A has an entry of B in its routing table as done in [44]. Graphs of the three topologies
of sizes 1000,10000, and 100000 nodes are generated for the analysis.
CHORD: For a graph of N nodes, links are added using the rule that each node i will con-
nect to nodes i − 1, and i + 1 to complete the ring, and have long-range links to nodes
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 45
i+ 2kmodN for k = 1...(log2N)− 1. Thus each node has a degree of log2N peers.
KOORDE: DeBruijn graph of constant degree 10 is generated to represent the KOORDE
topology. A graph of 10L nodes consist of all L sized strings of an alphabet with 10 characters
are generated, and an edge is added from one string to another if they differ by exactly one
character in any single location.
KADEMLIA: In order to construct the topology graph, for every node, the partition of the
space into buckets is generated according Section 2.3.2 and edges are added randomly be-
tween the node and a random node in each bucket.
4.3.4 Embedding the Botnet
In this analysis, a single botnet graph is embedded in one of the background graphs to create
a single dataset. The botnet graphs are embedded in the background graph by mapping each
node to a node in the background uniformly at random. Then the edges in the botnet graph
are added, in addition to the edges of the background graph for all the nodes in order to create
the final superimposed graph. Thus each bot node will also have edges to other benign nodes
in addition to other bots. The mapping of bot nodes to the background serves as ground truth
for validation. An example superimposed graph is depicted in Figure 4.2. For our experiments
botnet graphs of different sizes described in Section 4.3.3 are embedded in the different back-
ground graphs described in Section 4.3.2 to create the final datasets for analysis. The number
of distinct datasets (One botnet graph per one background graph) is equal to 6 (Number of
Background Graphs)∗6 (Number of Botnet Topologies) = 36.
Dataset Naming Convention
A single dataset or instance consists of a background graph described in Table 4.1, and a single
botnet topology from Section 4.3.3, with the embedding done as described in the previous
section. The naming convention followed for the rest of the thesis is
<Background Graph>-<Botnet Topology (CHO|KOO|KAD)>-<Botnet Size>
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 46
Figure 4.2: An example of a graph with a superimposed botnet (Image from [44])
Thus a dataset named as CH-1H-CHO-1000 implies that the dataset is a 1000 node CHORD
graph embedded in a background graph extracted from the 1 hour trace captured at the CHIC
router from the Abilene ISP.
4.4 Application of the Louvain method to identify structured
P2P botnets
In this section the Louvain method is applied to detect communities in the graphs generated
according to Section 4.3. BotGrep [44] (described in Section 2.5.1) is also run on the same
datasets with the input parameters set according to the reference.
4.4.1 Datasets and Evaluation
Datasets
The Abilene 1 Hour and 1 Day Traces are considered for this experiment. Thus the WASH-1H,
CHIC-1H, WASH-1D, CHIC-1D graphs are used as the background (Section 4.3.2). Botnets
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 47
of each of the three topologies of sizes 1000 and 10000 (Section 4.3.3) are embedded, which
correspond to the percentage of botnet nodes being within 0.1%-8%. Thus a total of 24 sets
(4 ∗ Background Graph + 6 ∗ Botnet Graph) are generated and are named according to the
convention in Section 4.3.4.
Metrics
The Louvain method is an unsupervised method and outputs a set of communities. In order
to identify the community, as in [44] it is assumed that honeypot information, i.e known bots,
and that the honeypot nodes belong in the cluster with the largest number of bots
Thus the purity of the cluster C with the largest number of the bot nodes is evaluated. Since
there is only one botnet graph embedded in the background graph, it becomes a binary classi-
fication problem. Let the set B be the set of the actual bots(obtained from ground truth), and
C be a detected cluster then the accuracy is quantified by Precision, Recall and FScore.
PrecisionP =|B ∪ C||C|
(4.2)
Precision represents the purity of each community by considering the fraction of botnet nodes
to the total nodes in the community.
RecallR =|B ∪ C||B|
(4.3)
Recall quantifies the detection rate , counting the total fraction of bots identified.
FScoreF =2× Precision×RecallPrecision+Recall
(4.4)
FScore is the harmonic mean of Precision and Recall and summarises both values.
4.4.2 Results and Discussion
The Louvain method is run on each of the 24 graphs considered, and the Precision , Recall
and FScore for the cluster with the largest number of bots are computed. BotGrep is also
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 48
implemented and run on the same datasets. The input parameters are assumed according to the
associated publication[44]. The cases of the 1 hour and 1 day background graphs are discussed
separately.
Abilene 1 Hour Trace Graphs
Tables 4.2a and 4.2b show the values of Precision and Recall of the Louvain Method and
BotGrep for the different botnet graphs embedded in the background graphs WA-1H and CH-
1H extracted from the 1 Hour Abilene Traces. Figures 4.3a and 4.3b plot the FScore of the
Louvain Method and BotGrep. The measures Precision, Recall and FScore are all computed
according to Section 4.4.1.
In the case of WASH-1H datasets (Figure 4.3a) indicate that the Louvain method performs
only slightly inferior to BotGrep in terms of FScore, BotGrep achieves an FScore of 0.95, while
the Louvain method achieves about 0.9 in most cases. Table 4.2a indicates that the Recall of
the Louvain method (> 97 in most cases) is comparable to BotGrep (Recall between 0.94-
0.98). The lower FScore achieved by the Louvain Method is because of the lower Precision
(0.9-0.95) as compared to BotGrep which has Precision 1.0 in the majority of the datasets
In the CHIC-1H datasets Figure 4.3b indicates that BotGrep achieves poor FScores of less
than 0.1 in some cases. This in comparison to its performance on WASH-1H (Figure 4.3a)
indicates that the algorithm is very sensitive to the input parameters. Figure 4.3b indicates
that the Louvain method again achieves about FScores of 0.9 in most cases as in WASH-1H.
It can be observed from Table 4.2a that BotGrep performs very poorly owing to small values
of Recall (< 0.5 in most cases). In the case of the Louvain method, the Precision as well as
Recall is > 0.9 in most cases.
Thus in the WASH-1H datasets, the performance of the Louvain method is comparable to that
of BotGrep and in the CHIC-1H datasets, the performance of the Louvain method is found to
be better than that of BotGrep. Thus in the case of the Abilene 1 Hour background graphs, the
Louvain method is able to identify the embedded structured P2P botnets and the performance
is found to be comparable or better than BotGrep.
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 49
(a) WASH-1H (b) CHIC-1H
(c) WASH-1D (d) CHIC-1D
Figure 4.3: Performance of the Louvain Method on Abilene Trace Graphs
Abilene 1 Day Trace Graphs
In the case of WASH-1D datasets (Figure 4.3c) indicate that BotGrep outperforms the Louvain
Method. BotGrep achieves an FScore > 0.85 in most cases, while the Louvain method per-
forms very poorly, achieving only about 0.2 in most cases. Table 4.2c indicates that poor per-
formance of the Louvain method is due to the precision values ( < 0.2 in all cases) as opposed
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 50
Botnet %BotsBotGrep Louvain
P R P R
CHO-1000 0.84 1.00 0.95 0.90 0.96
CHO-10000 8.39 1.00 0.95 0.95 0.95
KOO-1000 0.84 1.00 0.97 0.90 0.95
KOO-10000 8.39 0.99 0.94 0.96 0.93
KAD-1000 0.84 1.00 0.96 0.90 0.93
KAD-10000 8.39 0.99 0.95 0.84 0.92
(a) WA-1H
Botnet %BotsBotGrep Louvain
P R P R
CHO-1000 0.48 1.00 0.34 0.99 0.94
CHO-10000 4.85 1.00 0.03 0.98 0.95
KOO-1000 0.48 0.00 0.00 0.99 0.94
KOO-10000 4.85 1.00 0.08 0.80 0.94
KAD-1000 0.48 1.00 0.54 0.99 0.94
KAD-10000 4.85 1.00 0.34 0.93 0.92
(b) CH-1H
Botnet %BotsBotGrep Louvain
P R P R
CHO-1000 0.46 1.00 0.77 0.01 0.87
CHO-10000 4.60 0.98 0.79 0.14 0.88
KOO-1000 0.46 1.00 0.61 0.01 0.88
KOO-10000 4.60 1.00 0.76 0.13 0.85
KAD-1000 0.46 1.00 0.77 0.01 0.87
KAD-10000 4.60 0.98 0.80 0.14 0.89
(c) WA-1D
Botnet %BotsBotGrep Louvain
P R P R
CHO-1000 0.34 1.00 0.96 0.01 0.82
CHO-10000 3.36 1.00 0.35 0.09 0.80
KOO-1000 0.34 1.00 0.97 0.01 0.82
KOO-10000 3.36 1.00 0.40 0.08 0.76
KAD-1000 0.34 1.00 0.54 0.01 0.82
KAD-10000 3.36 1.00 0.43 0.07 0.70
(d) CH-1D
Table 4.2: Comparison of Louvain Modularity and BotGrep on Abilene Traces
to BotGrep which has Precision 1.0 in most cases. The Recall of the Louvain method(0.6-0.8)
is lower than that of BotGrep (0.85-0.9).
In the CHIC-1D datasets, Figure 4.3b indicates that BotGrep outperforms the Louvain method
again, but still performs poorly in this dataset. BotGrep achieves FScores between 0.4-0.7 in
most cases.The Louvain method achieves about FScores < 0.2 in all cases. It can be observed
from Table 4.2d that BotGrep performs poorly again owing to the low values of Recall (< 0.5
in most cases), indicating parameter sensitivity, but the Precision is 1 in all cases. In the case
of the Louvain method, the Recall is between 0.7-0.85, but shows poor FScores owing to the
lower Precision values (< 0.2 in all cases).
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 51
Thus in both the WASH-1D and the CHIC-1D datasets, the Louvain method detects a large
number of benign nodes apart from the bots, yielding larger communities than desired. The
performance is found to be much poorer than BotGrep.
Summary
The results of the application of the Louvain method on the Abilene background graphs is sum-
marised in Table 4.3. From the contrasting results of the application of the Louvain Method on
Background Density Modularity vs BotGrep
SparseWA-1H Comparable
CH-1H Better
DenseWA-1D Worse
CH-1D Worse
Table 4.3: Performance Summary of Modularity Optimization using the Louvain Method, with
reference to BotGrep
the Abilene 1 Hour and the denser Abilene 1 Day graphs, it can be concluded that the Louvain
Method is strongly affected by the density of the Background Graph. This is owing to the
Resolution limit defect present in Modularity Optimization methods in general. The resolu-
tion limit suffered by Modularity optimization algorithms, and a countermeasure proposed in
literature shall be discussed in the next section.
4.5 Community Detection at different resolutions and Mul-
tiresolution Modularity
4.5.1 Resolution Limit and Multiresolution Modularity
Modularity Optimization has been shown to suffer from a resolution limit by Fortunato and
Barthelemy[19]. Fortunato and Barthelemy show that the communities detected by modular-
ity optimization methods such as the Louvain Method may be large and made up of loosely
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 52
connected smaller but denser communities. These smaller communities thus will not be iden-
tified by the Modularity optimization method. It was also shown that a subgraph is considered
as a good community according to modularity if
IntEdges(C) >V ol(C)2
2|E|(4.5)
where V ol(C) =∑
i∈C di is the sum of degrees of the nodes in the subgraph C and
IntEdges(C) =∑
i,j∈C Aij is the number of internal edges in subgraph C
The above condition is therefore a function of the Internal number of edges of the subgraph,
as well as the number of connections the subgraph has to the rest of the nodes in the graph.
Thus the size can be controlled by modifying the definition of Modularity to incorporate a pa-
rameter that controls the internal density of the community, allowing it to detect communities
at multiple resolutions or scales.
4.5.2 Stability and Stability Optimization
Several equivalent definitions of multiresolution modifications to modularity have been intro-
duced in order to explore communities at various scales or sizes and have been surveyed by
Traag et al. [70]. This chapter will focus on the Stability proposed by Lambiotte et al. [36],
which incorporated the resolution parameter t into the definition of Modularity to obtain
QStability(P, t) =1
2 |E|∑C∈P
∑i,j∈C
tAij −didj2 |E|
(4.6)
With the introduction of the weighting parameter t in Equation 4.6, The condition 4.5 becomes
IntEdges(C) >V ol(C)2
2|E|t(4.7)
which is the condition that a subgraph C is a community according to Stability at a given value
of t. Thus as the value of t is increased, larger communities are given preference by Stability
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 53
as the internal edges IntEdges(C) can now be smaller than V ol(C)2
t. Conversely as t is de-
creased, smaller communities are preferred as the number of internal edges as compared to the
total degree of the community must be higher. At t=1, the original definition of Modularity is
obtained. Stability can be optimized using the Louvain method [59] with trivial modifications
to Equation 4.1 keeping the time complexity is unaltered.
4.6 Optimization of Stability using the Louvain Method to
identify Structured P2P Botnets
?? It is concluded in Section 4.4.2 that the density of the background affected the performance
of Modularity Optimization using the Louvain Method to detect structured P2P botnets. In
Equation 4.5, the total degree or V ol(C) of a subgraph C is a function of the density of the
background graph, thus as the background graph gets more denser, the number of internal
edges needed by a subgraph to be a valid community by modularity will increase. Thus if the
botnet graph is kept constant and the density of the background increases, the botnet graph will
tend to be detected inside a much larger community containing many benign nodes as well.
This provides an explanation of the failure of the Louvain method in detecting the embedded
structured P2P botnet subgraphs in the case of the denser Abilene 1 Day Traces.
Therefore Stability optimization by the Louvain Method could be used in order to detect struc-
tured P2P botnets in the case of the denser 1 Day Abilene Graphs by setting the value of t
¡ 1, which will force Stability to prefer smaller denser communities(Equation 4.7). Thus the
experiments in Section 4.4 can be repeated by replacing modularity as the objective function
with Stability.
In this section the the Louvain method shall be used to optimize stability at different values of
t in order to identify structured P2P botnets.
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 54
4.6.1 Datasets and Evaluation
As in 4.4.1 The same datasets used for the experiments in Section 4.4 are considered. The
Abilene 1 Hour and 1 Day Traces are considered for this experiment thus the WA-1H, CH-
1H, WA-1D, CH-1D graphs are used as the background 4.3.2. Botnets of each of the three
topologies of sizes 1000 and 10000 (Section 4.3.3) are embedded.The accuracy is quantified
by Precision, Recall and FScore as per Section 4.4.1
4.6.2 Results and Discussion
The Louvain method in order to optimize Stability at different values of t is run on each
of the 24 graphs considered, and the Precision , Recall and FScore for the cluster with the
largest number of bots are computed. The values of t chosen are between 2−6 = 0.015625 to
2−2 = 0.25 in exponentially increasing steps of 4. The Figure 4.4 shows the FScore achieved
for the different values of t, and Table 4.4 shows the Precision and Recall values.
Precision
From Table4.4 a global pattern can be observed that for all background graphs(WA-1H,CH-
1H,WA-1D and CH-1D), for all topologies(CHORD, KOORDE and KADEMLIA) for all sizes
of botnets(1000 and 10000) and for all values of t (0.25,0.0625,0.015625) the Precision is very
high (above 0.85). This indicates that Stability Optimization produces pure communities at
values of t ≤ 0.25.
Recall and FScore
From Table4.4 and Figure 4.4 it can be observed that there are large differences in the Recall
achieved by Stability Optimization. The discussion is broken up according to the value of t
and further broken up according to the size of the Botnet.
• t=0.015625: For this value of t, it can be observed that the Recall is low (< 0.5) across
backgrounds, botnet topologies and sizes. Consequently the FScore for this value of t is
also low (< 0.5 in all cases)
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 55
(a) WASH-1H (b) CHIC-1H
(c) WASH-1D (d) CHIC-1D
Figure 4.4: Performance of Stability Optimization on Abilene Trace Graphs
• t=0.0625: For this value of t, it can be observed that there are differences in Recall
depending on the size of the botnet
– Small Botnets: It can be observed that the Recall is high (> 0.9 in most cases) for
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 56
Botnet %Botst = 2−6 t = 2−4 t = 2−2
P R P R P R
CHO-1000 0.84 0.98 0.16 0.93 0.54 0.93 0.98
CHO-10000 8.39 0.98 0.01 0.98 0.04 0.89 0.24
KOO-1000 0.84 0.98 0.18 0.97 0.96 0.93 0.98
KOO-10000 8.39 0.87 0.01 0.85 0.02 0.98 0.41
KAD-1000 0.84 0.97 0.20 0.93 0.57 0.85 0.97
KAD-10000 8.39 0.98 0.01 0.92 0.04 0.86 0.17
(a) WA-1H
Botnet %Botst = 2−6 t = 2−4 t = 2−2
P R P R P R
CHO-1000 0.48 0.99 0.27 0.97 0.97 0.94 0.97
CHO-10000 4.85 0.97 0.02 0.96 0.07 0.97 0.28
KOO-1000 0.48 0.99 0.35 0.98 0.98 0.93 0.97
KOO-10000 4.85 0.99 0.02 0.96 0.03 0.96 0.66
KAD-1000 0.48 1.00 0.22 0.90 0.98 0.73 0.97
KAD-10000 4.85 0.99 0.02 0.95 0.08 0.86 0.32
(b) CH-1H
Botnet %Botst = 2−6 t = 2−4 t = 2−2
P R P R P R
CHO-1000 0.46 0.97 0.37 0.97 0.97 0.98 0.93
CHO-10000 4.60 0.98 0.03 0.99 0.08 0.98 0.39
KOO-1000 0.46 0.99 0.68 0.99 0.98 0.99 0.95
KOO-10000 4.60 0.98 0.02 0.93 0.04 0.98 0.89
KAD-1000 0.46 0.98 0.22 0.86 0.98 0.97 0.93
KAD-10000 4.60 0.99 0.02 0.96 0.09 0.97 0.31
(c) WA-1D
Botnet %Botst = 2−6 t = 2−4 t = 2−2
P R P R P R
CHO-1000 0.34 0.98 0.35 0.94 0.95 0.90 0.85
CHO-10000 3.36 1.00 0.05 1.00 0.13 0.99 0.91
KOO-1000 0.34 0.99 0.84 0.96 0.95 0.99 0.93
KOO-10000 3.36 0.96 0.02 0.98 0.07 0.38 0.87
KAD-1000 0.34 0.99 0.39 0.97 0.95 0.96 0.86
KAD-10000 3.36 0.99 0.04 0.94 0.09 0.96 0.58
(d) CH-1D
Table 4.4: Optimizing Stability at different values of t on Abilene Traces using the Louvain
Method
the 1000 node botnet graphs of all topologies (CHORD, KOORDE and KADEM-
LIA), correspondingly it can be observed that the FScores are high for the 1000
node botnet graphs(> 0.9).
– Large Botnets: It can be observed that the Recall is very low for the 10000 node
botnet graphs (< 0.15). Correspondingly it can be observed that the FScores are
low for the 10000 node botnet graphs (< 0.2).
• t=0.25: For this value of t, it can be observed that the trend is very similar to that
observed with t = 0.0625,with low Recall and FScores for the 10000 node botnets, and
high Recall and FScores for 10000 node Botnets. However Stability Optimization at
t = 0.25 outperforms the case of t = 0.0625 in the case of the 10000 node botnets
(Recall > 0.3 and FScore > 0.4) and performs only slightly worse than the case of
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 57
t = 0.0625 in the case of 1000 node botnets and is thus the most consistent among the
values experimented.
Comparison with BotGrep
It was observed that Stability Optimization at t = 0.25 using the Louvain Method provided
the most consistent results among the values of t tested. In Figure 4.5, the FScore achieved by
Stability Optimization(t=0.25) is compared to the FScore achieved by BotGrep on the Abilene
Trace Graphs. This is plotted in Figure 4.5. From Figure 4.5 it can be observed that Stability
Optimization(t=0.25) is comparable to BotGrep in terms of FScore in the CH-1H and CH-
1D datasets. However in the case of WA-1H (4.5a), Stability Optimization(t=0.25) performs
poorly for the 10000 node botnets and comparable in the case of the 1000 node Botnets. In the
case of WA-1D (4.5c), Stability Optimization(t=0.25) outperforms BotGrep in the 1000 node
botnet graphs, and performs poorer than BotGrep in the case of the 10000 node botnet graphs.
Conclusion
Thus from the results discussed above, it can be concluded that Stability Optimization is more
sensitive to the size of the botnet graph, and less sensitive to the density of the background
graph. As it is sensitive to the size of the botnet graph, it will be difficult to identify a single
value of t that will be able to detect all sizes of botnets. Thus the Louvain method to optimize
Stability has to run at different values of t, increasing the computational complexity of the
process. Further there has to be access to some method which will be able to terminate the
parameter search for t by identifying that a community detected is maximally botnet.
4.7 Summary
In this Chapter, the Louvain Method is discussed in detail. The dataset generation procedure is
described, which included the embedding of a botnet graph on to a background graph.Undirected
unweighted background graphs were constructed from real world Network Traces with nodes
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 58
(a) WASH-1H (b) CHIC-1H
(c) WASH-1D (d) CHIC-1D
Figure 4.5: Comparison of Stability Optimization(t=0.25) and BotGrep on Abilene Trace
Graphs
as IP Addresses, and the presence of an edge between two nodes indicating that a packet has
been sent. The background graphs were of different densities based on duration of capture of
Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 59
Botnet SizeBackground Louvain Method
Density Modularity vs BotGrep Stability(t=0.25) vs BotGrep
Small SparseWA-1H Comparable Comparable
CH-1H Better Better
Small DenseWA-1D Worse Better
CH-1D Worse Comparable
Large SparseWA-1H Comparable Worse
CH-1H Better Worse
Large DenseWA-1D Worse Worse
CH-1D Worse Worse
Table 4.5: Performance Summary of Modularity Optimization vs Stability Optimiza-
tion(t=0.25) using the Louvain Method, with reference to BotGrep
the Network Trace. The botnet graphs were generated to model structured P2P based Com-
mand and Control Flow.Edges were added according to the routing table at each node. The
Louvain method is used to optimize Modularity in order to detect the embedded structured
P2P botnet graphs from the created dataset. It was then found that Modularity optimization
was sensitive to the density of the background. In order to overcome this dependence, the Lou-
vain method was then used to optimize Stability at various values of the resolution parameter
t < 1 on the created datasets. It was found that the value of t = 0.25 was the most consistent
across the datasets. It is found that Stability Optimization is sensitive to the size of the Botnet.
It is concluded that several runs of the method at different values of t had to be run in order to
detect all sizes, increasing the computational complexity. Both the Modularity Optimization
and the Stability Optimization methods were compared with BotGrep on the same datasets,
the results of this are summarised in Table 4.5
It is found that neither Modularity Optimization nor Stability Optimization were robust and
general. However the optimization of Stability at values of t < 1 is found to result in small
but homogeneous communities, this property is further explored in the next chapter in order
to arrive at alternative approach that is robust and efficient in the detection of structured P2P
Chapter 5
A robust algorithm for identification of
Structured P2P Botnets
5.1 Introduction
In Chapter 4 the Louvain method is used to optimize Modularity and Stability in order to
detect structured P2P Botnets. It is observed that both methods had shortcomings in terms
of sensitivity to the density of the background or the size of the botnet. It is also observed
that Stability Optimization at values of the resolution parameter t < 1 resulted in small and
homogeneous communities of either only botnet nodes or only benign nodes. This property
can be exploited if a technique is developed in order to filter out the benign communities. In
this chapter a robust algorithm to detect structured P2P botnets based on the filtering of the
benign communities is proposed.
The method of obtaining the homogeneous communities robustly and efficiently is discussed
in Section 5.2. In order to identify the bot communities, they must be differentiated from
the communities of benign nodes. Section 5.3 defines a novel measure, mean regular degree
or mreg which is able to exploit the properties of structured P2P botnet graphs in order to
distinguish them from benign communities. Section ?? describes the proposed algorithm that
combines the use of greedy community detection and community filtering using mreg in order
to identify nodes that are a part of a structured P2P botnet.The The performance and robustness
61
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 62
of proposed algorithm is comprehensively validated in Section 5.5 and found to be comparable
in accuracy to BotGrep [44] in most cases, and found to be significantly faster in runtime.
5.2 Obtaining Small and Homogeneous Communities
As discussed in Chapter 4 it is observed that Stability Optimization at values of the resolu-
tion parameter t < 1 resulted in small and homogeneous communities, however there were
inconsistencies in the performance of Stability Optimization at among different values t for
different sizes of botnets. The search for the correct value of t would increase the compu-
tational time, as the Louvain method will have to be run several times at different values of
t. Thus a non-parametric objective function which behaves similarly to Stability at values of
t < 1 is desired.
5.2.1 An alternative Objective Function - Qw−log−v
In a recent paper by VanLaarhoven and Marchiori [73] the performance of various objective
functions optimized by the Louvain method to study their resolution biases. In this work they
also introduce a new objective function Qw−log−v. The Louvain method is a simple, general
algorithm that can be used to optimize objective functions other than Modularity and Stability.
The authors observed that optimizing Qw−log−v objective function using the Louvain method
produces a partition with communities smaller than those obtained by the optimization of
modularity. This objective function is defined as
Qw−log−v(P ) = −∑C∈P
(∑i,j∈C
Aij2 |E|
)log
(∑i∈C
di2 |E|
)(5.1)
Van Laarhoven and Marchiori [72] have used the Louvain method to optimize Qw−log−v suc-
cessfully in the area of Protein Interaction Networks and found that it outperforms Modularity
Optimization.
This objective function is named so by the authors owing to its dependence onw =∑
i,j∈CAij
2|E|
and log v = log(∑
i∈Cdi
2|E|
). The w term takes values between zero and one, and the log v
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 63
term takes negative values as it is the logarithm of v which varies between zero and one. As
the negative value of the function is maximised, the objective function will score high for com-
munities that have a small total degree, and have most of the total degree of the community
accounted for by the internal edges. When all the nodes of the graph is considered as one
community, Qw−log−v takes the value 0.
The objective function can be optimized by the Louvain method with appropriate changes to
the Greedy Optimization Step described in Algorithm 2 in Chapter 4. In this step, after initial-
ization of each node in its own community, each node is iteratively moved to the community
that causes the greatest gain in Qw−log−v. This yields Algorithm 3.
The gain obtained in moving a node i from its community to that of node j in the case of
Algorithm 3: GreedyOptimizationQw−log−v
Input : The graph G
Output: The partition P of G into communities
begin
initialize P such that each node is in its own community
Qprev ←− Qw−log−v(P )
Qcurrent ←−∞
while Qcurrent > Qprev do
for node i ∈ V do
MaxGain←− 0
for node j adjacent to i do
MaxGain←− max(MaxGain,∆Qw−log−v(i, j))
end
move the node i to that community that results in max gain
end
end
end
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 64
Qw−log−v is written as
∆Qw−log−v(i, j) = −((IntEdges(Cj) + IntEdgesNode(Cj, i)
2absE)log(
V ol(Cj) + di2absE
)
+ (IntEdges(Ci)− IntEdgesNode(Ci, i)
2absE)log(
V ol(Ci)− di2absE
)
− (IntEdges(Cj)
2absE)log(
V ol(Cj)
2absE)− (
IntEdges(Ci)
2absE)log(
V ol(Ci)
2absE)) (5.2)
where V ol(C) =∑
i∈C diis the sum of degrees of the nodes in the community C and
IntEdgesNode(C, i) =∑
k∈C Aik is the number of edges node i has to other nodes in commu-
nity C and IntEdges(Ci) =∑i, j ∈ CAij is the number of internal edges in community C.
This gain can be computed in constant time by additionally pre-computing and maintaining
the IntEdges(Ci) and V ol(C) for each community and updating this when a node is moved.
The IntEdgesNode(C, i) is computed during the iteration over the neighbours of a node, thus
the time complexity of GreedyOptimizationQw−log−v is O(|E|). As the Louvain method
is a multi-level method whose time complexity depends on the greedy optimization step, its
complexity when used to optimize Qw−log−v is also O(|E|).
5.2.2 Optimizing Qw−log−v - Louvain method vs Single Step Greedy Op-
timization
Qw−log−v can be optimized using the greedy procedure outlined in Algorithm 3 to obtain a
partition of the graph into communities. As discussed in Chapter 4, the Louvain method is
a multi-level optimization procedure that uses the greedy optimization at every level. The
partition obtained at a level is used to generate a weighted graph for the next level by collapsing
the communities into nodes. This effectively results in the merging of small communities that
are formed at each level.
It is of interest to obtain small and homogeneous communities, thus the use of the multi-level
Louvain method may reduce the homogeneity of the communities by the merging of the small
communities at higher levels. An experimental validation of this is performed here.
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 65
Dataset and Evaluation
The procedure for generating the Datasets is described in Section 4.3. The Abilene WASH
1 Hour and 1 Day Trace Graphs superimposed separately with a 1000 or 10000 CHORD
Botnet to create 4 Datasets - WA-1H-CHO-1000,WA-1H-CHO-1000,WA-1D-CHO-1000,WA-
1D-CHO-10000.
The objective is to study the homogeneity of the communities produced by the optimization
of Qw−log−v by the Louvain and the Greedy Optimization methods. In order to capture the
homogeneity, the average precision of the botnet nodes over the communities containing atleast
one bot is computed. Let this be the set of communities B.
µPrecisionB (B) =1
|B|∑C∈B
P (C) (5.3)
where
Precision P (C) =# of bots in community C
|C|
Results and Discussion
The Louvain method to optimize Qw−log−v and the GreedyOptimizationQw−log−v methods
are applied to the WA-1H-CHO-1000,WA-1H-CHO-1000,WA-1D-CHO-1000,WA-1D-CHO-
10000 datasets, and the results are tabulated in Table 5.1. P represents the set of communities,
whose size is denoted by |P |, and B ⊂ P represents the communities with at-least 1 bot,
whose size is denoted by |B|. The mean precision is computed for the set B according to
Equation 5.3.
It can be observed from Table 5.1a that the average precision µPrecisionB of the botnet com-
munities is less than 0.6 in all cases when Qw−log−v is optimized using the Louvain method.
From Table 5.1b it can be seen that µPrecisionB is greater than 0.6 in all cases, confirming the
hypothesis that the merging procedure reduces the homogeneity of the botnet communities.
Thus when homogeneous communities are desired, GreedyOptimizationQw−log−v (Algo-
rithm 3) is the better candidate. This also offers an additional improvement in computational
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 66
Dataset |P | |B| µPrecisionB
WA-1H-CHO-1000 1720 24 0.37
WA-1H-CHO-10000 1429 185 0.59
WA-1D-CHO-1000 109 13 0.08
WA-1D-CHO-10000 84 31 0.45
(a) Optimization using Louvain
Dataset |P | |B| µPrecisionB
WA-1H-CHO-1000 8753 74 0.67
WA-1H-CHO-10000 8165 735 0.74
WA-1D-CHO-1000 2809 65 0.74
WA-1D-CHO-10000 3150 559 0.85
(b) Single Step Greedy Optimization
Table 5.1: Community Structure obtained by optimization of Qw−log−v using the Louvain
method on CHORD Botnets embedded in Abilene WASH Router Trace Graphs
speed when compared to the original Louvain method.
5.3 Differentiating between bot and benign communities
5.3.1 Properties of Structured P2P Botnets vs Properties of the Back-
ground
The subgraph corresponding to Structured P2P nodes are near regular – i.e. the nodes have
similar degrees, also implying that they are assortative(Section 3.2.6) – i.e. nodes with similar
degrees connect to other nodes with similar degrees. It has been shown by Yen and Reiter [79]
that assortativity makes the botnet robust and resistant to take down. This is in contrast with the
background graph, which is mostly dominated by nodes participating in client-server traffic.
The background graphs exhibit power law degree distributions (Section 3.2.4), indicating the
presence of hubs that attract a lot of connections. The hubs or high degree nodes which are
popular servers that attract a lot of connections from Internet hosts, and they rarely connect to
each other. This results in the graph being disassortative, as the low degree nodes connect to
dissimilar higher degree nodes.
The subgraph corresponding to Structured P2P nodes should have a relatively higher value of
density, when compared to the rest of the graph as they have to be robust against node failures
by adding redundant paths.
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 67
5.3.2 Properties of the small and homogeneous communities obtained by
the greedy optimization of Qw−log−v
As discussed in Section 5.2.2 the single step greedy optimization of Qw−log−v results in a large
number of small communities. This can be viewed the result of an incomplete or partial op-
timization of the function. Owing to this incomplete optimization, the structured P2P botnet
will fragment into smaller pieces and the high degree nodes or hubs tend to form small com-
munities dragging in their immediate neighbourhood consisting of the adjacent nodes.
The nodes adjacent to hubs may be benign or bots as computers infected with bots also partic-
ipate in client server traffic that is not malicious. However the bot nodes, which are a part of
a structured P2P subgraph will behave differently from the benign nodes. The benign nodes
will attach to the hubs forming star-like communities. The subgraph corresponding to such
communities will be disassortative as well as relatively sparse.
The bot nodes will be pulled in to a community with other bot nodes owing to the high internal
connectivity and resist being moved into the community of the hub. At the same time the hub
will resist joining the community of the bot nodes as it will tend to increase the total degree of
the community owing to its high degree. Due to the symmetry associated with the near regular
nature of the whole structured P2P graph, the pieces of the structured P2P botnet will tend to
break into fragments that are also near regular. This makes the subgraph corresponding to the
pieces or fragments of the botnet assortative, and relatively denser than to star-like communi-
ties owing to the presence of redundant paths between the nodes.
Thus in order to differentiate between communities of bots and communities of benign nodes,
the differences between the density and the assortativity of the corresponding subgraphs can
be exploited.
5.3.3 Mean Regular Degree mreg
A novel measure called mean regular degree or mreg is defined in this thesis, which measures
the mean number of connections of the node to other nodes with similar degrees.
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 68
mreg(G) =1
|V |∑i,j∈E
1
1 + |di − dj|(5.4)
The above measure mreg accounts for the degree correlation or assortativity(Section 3.2.6) of
a graph with the denominator term that depends on the difference of the degrees. This measure
is also proportional to the total degree of the community owing to the sum over all the edges,
accounting for the density. The time taken to evaluate this metric is O(|E|) as each edge is
considered only once during the computation, allowing it to be computed efficiently.
Properties of mean regular degree
For a k-regular graph (degree of each node is k), the value of mreg = 1|V |∑
i,j∈E 1 = 2|E||V | =
k|V ||V | = k Thus a ring graph (k = 2 ) of any size will have mreg = 2 and a clique or a complete
graph of size |V | will have mreg = |V |. Thus the value of mreg increases as the degree of each
node increases.
For a star graph of size |V | the value of mreg = 1|V |∑
i,j∈E1
1+|V |−2 = 2|E||V |−1 = 2(|V |−1)
|V |(|V |−1) = 2|V | .
Thus for a star graph, the value of mreg decreases as the size of the graph increases.
A dyad (two nodes with a single edge) which is both a 1-regular graph as well as a star graph
of 2 nodes will have mreg = 1. A graph that is larger than 2 nodes, and is star-like will have
small values of mreg < 1.
5.4 Robust and efficient method to identify nodes that are
part of structured P2P Botnet
A recent paper by Illioufotou et al.[28] proposed a method to classify different types of traffic
flows based on only the topological characteristics of the IP-IP traffic graph. The homophily
(like-attracts-like) among the same classes of traffic is exploited in this work. Given a set flows
which have been labelled they classify traffic by developing a method based on community de-
tection and a homophily based link classifier. They use the Louvain method for community
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 69
detection by recursively applying it to obtain smaller and homogeneous clusters containing
predominantly a single type of seed. Their algorithm is also able to classify P2P traffic given
seed flows labelled as P2P. They aim to perform traffic flow classification.
In this thesis the aim is to specialize their approach to detect structured P2P botnets by exploit-
ing the topological characteristics of structured P2P subgraphs instead of relying on labelled
flows. Further the recursive application of the Louvain method is avoided by exploiting the
use of greedy optimization of Qw−log−v. This motivates a method that relies on the obtained
homogeneous communities and the use ofmreg to be able to discard unlikely communities and
arrive at a set consisting of nodes that are a part of a structured P2P botnet.
The proposed algorithm consists of two stages. In the first stage, communities are obtained
by greedy optimization of Qw−log−v followed by the discarding of benign communities using
mreg. In the second stage, the communities retained from the first stage are collapsed to form
a weighted graph, and greedy modularity optimization is carried out followed by another fil-
tering of communities in order to obtain a final set of nodes that are a part of a structured P2P
Botnet.
5.4.1 Stage 1
The stage 1 of the proposed algorithm runs GreedyOptimzationQw−log−v. This will result
in homogeneous communities (Section 5.3.2). The mreg for each community is obtained by
extracting its corresponding subgraph and computing it according to Equation 5.4. A direct
filtering of the communities with mreg > 1 (based on properties in 5.3.3) would result in
poor recall as owing to the incomplete optimization(Section 5.3.2), some botnet communities
may be too small and thus indistinguishable from other benign communities owing to values of
mreg < 1. Therefore another chance has to be given for these communities to merge with other
bot communities to increase themreg value. Communities withmreg less than the median value
of mreg are selected for the next stage. This median value will be less than 1 as the number
of structured P2P bots are less than 50% of the total nodes of the graph. The median value
of mreg is preferred over the mean value over all community subgraphs as the sizes of the
communities are skewed, thus the mean mreg will be affected strongly by the larger and more
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 70
regular communities. Stage 1 is outlined in Algorithm 4.
Algorithm 4: Proposed Algorithm - Stage 1Input : The graph G
Output: A set of selected candidate bot communities Pselected
begin
P =←− GreedyOptimizationQw−log−v(G)
for community Ci ∈ P do
GCi= SubGraph(G,Ci)
mCi= mreg(GCi
)
end
mmed = Median(mC1 ...mC|P |)
Pselected ={Ci : m(GCi
) > mmed
}end
5.4.2 Stage 2
In order to allow some small botnet communities to merge the GreedyOptimizationQModularity
is run on the weighted graph obtained by collapsing the communities obtained from Stage
1. Here QModularity is used as the objective function as it allows for larger communities than
Qw−log−v. The aggressive merging due to optimization of modularity is reduced by the removal
of many hubs in Stage 1, which connect together several communities and play a role in the
merging of communities. The communities thus obtained are for the weighted graph, and are
converted back to communities of the original graph by expanding each node of the weighted
graph.
Finally the filtering step needs to be applied to separate out communities that are not very
regular. As discussed in Section 5.3.3. Communities of bots, being assortative and denser will
have values of mreg > 1, whereas the star like communities of benign nodes will have values
ofmreg < 1 making it a good rule for filtering out benign communities. The communities with
mreg ≤ 1 can be discarded at this stage.
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 71
Algorithm 5: Proposed Algorithm - Stage 2Input : The graph G,the set of selected communities from Stage1 Pselected
Output: A set Cfinal consisting of structured P2P botnet nodes
begin
Gw ←− CommunityAggregation(G,Pselected)
Pw ←− GreedyOptimizationQModularity(Gw)
P ←− PartitionExpansion(G,Gw, Pw)
for community Ci ∈ P do
GCi= SubGraph(G,Ci)
if mreg(GCi) > 1 and σdeg(Ci) < µdeg(Ci) then
Add all nodes in Ci to Cfinal
end
end
end
There can be cases of certain non-regular communities which may have a value of mreg ≥ 1,
this is because mreg is proportional to the internal degree of the community as well (Section
4.4.1). As structured P2P botnets are near regular, the mean internal degree, µdeg(GC) =
1|VC |
∑i=|VC |i=1 di and the variance σdeg(GC) =
√( 1|VC |
∑i=|VC |i=1 d2i )− µdeg(GC)2 of the subgraph
corresponding to each community is computed and communities which have σdeg(GC) >
µdeg(GC) are discarded. Stage 2 is outlined in Algorithm 5. The Community Aggregation
and the Partition Expansion steps are discussed in Section 4.2.
5.5 Evaluation
In this section a comprehensive evaluation of the proposed method will be carried out and its
performance will be compared to that of BotGrep[44]. The dataset generation procedure is
described in Section 4.3.
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 72
5.5.1 Metrics
The metrics used for evaluation consider the set Cfinal obtained from the proposed algorithm
and are given by
Precision =# of bots in Cfinal
|Cfinal|(5.5)
Recall =# of bots in Cfinal
Total # of bots(5.6)
FScore =2× Precision×RecallPrecision+Recall
(5.7)
Precision measures the purity of the set Cfinal, Recall measures the detection rate, and the
FScore being the harmonic mean of Precision and Recall provide a single summary statistic to
evaluate performance.
5.5.2 Performance on Abilene Trace Graphs
The proposed method and BotGrep are tested on the Abilene 1 hour and 1 day traces. The
CHORD, KOORDE and KADEMLIA topologies of 1000 and 10000 nodes are considered
(Section 4.3.3, and each of the 6 botnet graphs generated is embedded in each of the 4 back-
ground graphs WA-1H,CH-1H, WA-1D and CH-1D(Section 4.3.2) to create 24 distinct datasets
of one embedded botnet graph per background. The proposed method and BotGrep is run on
these datasets. The FScores (computed using Equation 5.7) achieved by the two algorithms
are compared in Figure 5.1. The Precision (Equation 5.5) and Recall(Equation 5.6) values are
compared in Tables 5.2.
1 Hour Traces
It can be observed from Figure 5.1a that the proposed method achieves FScores > 0.9 which is
slightly inferior and thus comparable to that of BotGrep which achieves FScores > 0.95. The
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 73
(a) WASH-1H (b) CHIC-1H
(c) WASH-1D (d) CHIC-1D
Figure 5.1: Performance comparison of the proposed method and BotGrep on Abilene Trace
Graphs - FScore
slightly lower performance of the proposed method is because of its precision values (Table
5.2a) which are between 0.8-0.9 compared to that of BotGrep which achieves a precision of 1
in most cases, the values of recall of both are almost the same at around 0.95 in most cases.
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 74
Botnet %BotsBotGrep Proposed
P R P R
CHO-1000 0.84 1.00 0.95 0.88 0.96
CHO-10000 8.39 1.00 0.95 0.91 0.95
KOO-1000 0.84 1.00 0.97 0.87 0.94
KOO-10000 8.39 0.99 0.94 0.91 0.93
KAD-1000 0.84 1.00 0.96 0.81 0.96
KAD-10000 8.39 0.99 0.95 0.90 0.96
(a) WA-1H
Botnet %BotsBotGrep Proposed
P R P R
CHO-1000 0.48 1.00 0.34 0.97 0.96
CHO-10000 4.85 1.00 0.03 0.95 0.96
KOO-1000 0.48 0.00 0.00 0.96 0.96
KOO-10000 4.85 1.00 0.08 0.97 0.95
KAD-1000 0.48 1.00 0.54 0.83 0.96
KAD-10000 4.85 1.00 0.34 0.97 0.96
(b) CH-1H
Botnet %BotsBotGrep Proposed
P R P R
CHO-1000 0.46 1.00 0.77 0.87 0.94
CHO-10000 4.60 0.98 0.79 0.97 0.92
KOO-1000 0.46 1.00 0.61 0.96 0.94
KOO-10000 4.60 1.00 0.76 0.98 0.90
KAD-1000 0.46 1.00 0.77 0.96 0.94
KAD-10000 4.60 0.98 0.80 0.97 0.93
(c) WA-1D
Botnet %BotsBotGrep Proposed
P R P R
CHO-1000 0.34 1.00 0.96 0.70 0.86
CHO-10000 3.36 1.00 0.35 0.97 0.87
KOO-1000 0.34 1.00 0.97 0.85 0.87
KOO-10000 3.36 1.00 0.40 0.95 0.80
KAD-1000 0.34 1.00 0.54 0.74 0.88
KAD-10000 3.36 1.00 0.43 0.97 0.89
(d) CH-1D
Table 5.2: Performance comparison of the proposed method and BotGrep on Abilene Trace
Graphs - Precision and Recall
Figure 5.1b indicates that in the case of CHIC-1H dataset, the proposed method outperforms
BotGrep with FScores above 0.95 in most cases as compared to BotGrep, which achieves
FScores less than 0.6 in most cases. Table 5.2b indicates that the poor performance of BotGrep
is due its poor recall (< 0.5 in most cases).
1 Day Traces
Figure 5.1c indicates that in the case of the WASH-1D dataset, the FScores achieved by the
proposed algorithm (> 0.95 in most cases) is comparable to that of BotGrep (about 0.9 in most
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 75
cases). Table 5.2c indicates that the proposed Algorithm outperforms BotGrep in recall (>
0.9 in all cases versus < 0.8 in all cases). The precision of both algorithms are comparable
, with BotGrep having slightly better precision (0.98-1) as compared to the proposed method
(<=0.98).
It can be observed from Figure 5.1d that the proposed method achieves higher FScores (above
0.8) than BotGrep (below 0.7) in most cases. Table 5.2d indicates that this is owing to the
lower recall achieved by BotGrep (below 0.6 in most cases) as compared to the proposed
algorithm(> 0.87 in most cases). In the case of precision, BotGrep achieves 1 in most cases,
the proposed method is comparable with above 0.95 in most cases.
Summary
The performance on the Abilene datasets is summarised in Figure 5.3. In the case of the
Botnet Size Background Density Proposed Algorithm vs BotGrep
Small SparseWA-1H Comparable
CH-1H Better
Small DenseWA-1D Comparable
CH-1D Better
Large SparseWA-1H Comparable
CH-1H Better
Large DenseWA-1D Comparable
CH-1D Better
Table 5.3: Summary of the proposed method with reference to BotGrep on the Abilene traces
(FScore)
Abilene Trace graphs the proposed method performs comparable or better than BotGrep (Table
5.3). Thus the limitations of Stability Optimization and Modularity Optimization in detecting
structured P2P botnets (Chapter 4) have been overcome.
In the next section the robustness of the algorithm for performance under conditions of reduced
visibility, as well its performance on a sparse, harder to detect topology is studied.
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 76
5.6 Robustness of the proposed algorithm
5.6.1 Robustness under conditions of partial visibility
In order to test the method’s robustness under conditions of partial visibility, 40% of the edges
of the botnet graph are removed before the embedding stage. The number 40% is based on the
findings in [44], and [77] that 60% of the botnet flows can be observed if monitors are deployed
at each of the Tier 1 Internet Service Providers based on Storm Botnet IP address distributions
in [44] and simulations in [77]. The FScores achieved by the two algorithms are compared in
Figure 5.1. The Precision and Recall values are compared in Tables 5.2. The datasets adopted
are otherwise the same as in Section 5.5.2. The metrics are evaluated as per Section 5.5.1
1 Hour Traces
It can be observed from Figure 5.2a that the proposed method achieves FScores around 0.9
which is slightly inferior yet comparable to that of BotGrep which achieves FScores > 0.95
as in the case of complete visibility. The slightly lower performance of the proposed method
is because of its precision values (Table 5.4a) which are between 0.8-0.9 compared to that of
BotGrep which achieves a precision of 0.99 in most cases, as well as the values of recall(0.9
versus > 0.9). The partial visibility thus affects the recall in this case.
Figure 5.2b indicates that for CHIC-1H dataset as per the case of complete visibility, the pro-
posed method outperforms BotGrep with FScores above 0.95 in most cases as compared to
BotGrep, which achieves FScores less than 0.5 in most cases. Table 5.4b indicates that the
poor performance of BotGrep is due its poor recall (< 0.3 in most cases).
1 Day Traces
Figure 5.2c indicates that for the WASH-1D dataset, the performance is again found to be as
per the case of complete visibility. The FScores achieved by the proposed algorithm (around
0.9 in most cases) is comparable to that of BotGrep (about 0.8 in most cases). Table 5.4c
indicates that the proposed Algorithm outperforms BotGrep in recall (> 0.8 in most cases
versus < 0.7 in all cases), indicating that the recall of both algorithms have been affected. The
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 77
(a) WASH-1H (b) CHIC-1H
(c) WASH-1D (d) CHIC-1D
Figure 5.2: Performance comparison of the proposed method and BotGrep on Abilene Trace
Graphs under conditions of partial visibility - FScore
precision of both algorithms are found to be comparable.
It can be observed from Figure 5.2d that the proposed method achieves lower FScores(between
0.6-0.8) as compared to BotGrep(above 0.9 in most cases). Table 5.4d indicates that this
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 78
Botnet %BotsBotGrep Proposed
P R P R
CHO-1000 0.84 0.98 0.94 0.87 0.91
CHO-10000 8.39 0.99 0.93 0.93 0.90
KOO-1000 0.84 0.99 0.95 0.85 0.92
KOO-10000 8.39 0.99 0.92 0.95 0.88
KAD-1000 0.84 0.99 0.93 0.89 0.88
KAD-10000 8.39 0.99 0.93 0.93 0.90
(a) WA-1H
Botnet %BotsBotGrep Proposed
P R P R
CHO-1000 0.48 1.00 0.13 0.83 0.94
CHO-10000 4.85 1.00 0.01 0.96 0.93
KOO-1000 0.48 1.00 0.24 0.93 0.95
KOO-10000 4.85 1.00 0.05 0.96 0.93
KAD-1000 0.48 0.00 0.00 0.92 0.94
KAD-10000 4.85 1.00 0.01 0.96 0.93
(b) CH-1H
Botnet %BotsBotGrep Proposed
P R P R
CHO-1000 0.46 1.00 0.69 0.96 0.80
CHO-10000 4.60 0.95 0.71 0.99 0.77
KOO-1000 0.46 1.00 0.69 0.97 0.85
KOO-10000 4.60 0.95 0.68 0.99 0.83
KAD-1000 0.46 0.97 0.69 0.85 0.81
KAD-10000 4.60 0.98 0.72 0.98 0.88
(c) WA-1D
Botnet %BotsBotGrep Proposed
P R P R
CHO-1000 0.34 0.94 0.87 0.76 0.62
CHO-10000 3.36 0.96 0.89 0.98 0.78
KOO-1000 0.34 0.97 0.88 0.70 0.74
KOO-10000 3.36 0.97 0.84 0.94 0.64
KAD-1000 0.34 1.00 0.25 0.62 0.51
KAD-10000 3.36 0.97 0.89 0.94 0.64
(d) CH-1D
Table 5.4: Performance comparison of the proposed method and BotGrep on Abilene Trace
Graphs under conditions of partial visibility - Precision and Recall
is owing to the lower recall achieved by the proposed algorithm (below 0.8 in most cases)
as compared to BotGrep(> 0.85 in most cases). In the case of precision, BotGrep achieves
1 in most cases, the proposed method however achieves lower precision in the case of the
smaller botnets, this is expected as the method depends on the density differences between the
botnet and the background, the CHIC-1D is the densest background graph among the datasets
considered (Section 4.3.2), and the removal of 40% edges of the small botnets sharply reduces
the density of the botnet making it harder to detect. For the larger botnet it achieves precision
> 0.9.
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 79
Summary
Table 5.5 summarises the performance of the algorithm under conditions of partial visibility in
Abilene Trace Graphs
Thus in cases of partial visibility the algorithm is still reasonably robust, but as the background
Botnet Size Background Density Proposed Algorithm vs BotGrep
Small SparseWA-1H Comparable
CH-1H Better
Small DenseWA-1D Comparable
CH-1D Worse
Large SparseWA-1H Comparable
CH-1H Better
Large DenseWA-1D Comparable
CH-1D Worse
Table 5.5: Performance(FScore) Summary of the proposed method with reference to BotGrep
on the Abilene traces under conditions of partial visibility
becomes denser the performance suffers.
5.6.2 Performance on the LEET-Chord Topology
The LEET-Chord structured P2P topology was proposed by Jelasity and Bilicki [30] to demon-
strate that there are techniques to make structured P2P traffic harder to detect. The topology
involves a modification of the CHORD topology, with clusters of CHORD graphs of log2|V |,
connected to each, with the restriction that only link (the long range link) exists between any
two clusters. They also propose a method of clustering so as to minimize the number of
different networks touched by the clusters, highlighting the limited effectiveness of local ap-
proaches. This topology is significantly sparser than the other P2P topologies considered, and
should be harder to detect due to this reason. The performance of the proposed method on the
same backgrounds by embedding LEET-Chord graphs 1000 and 10000 nodes on the Abilene
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 80
Background Graphs WASH-1H,CHIC-1H, WASH-1D,CHIC-1D. The metrics are evaluated as
per Section 5.5.1. From the Figure 5.3 and Table 5.6 it can be observed that in all datasets
(a) WASH-1H (b) CHIC-1H
(c) WASH-1D (d) CHIC-1D
Figure 5.3: Performance comparison of the proposed method and BotGrep on LEET-Chord
graphs embedded in Abilene Trace Graphs - FScore
the proposed method is able to detect the LEET-Chord botnet with an FScore of around 0.8,
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 81
Botnet %BotsBotGrep Proposed
P R P R
WA-1H-LC-1000 0.84 0.29 0.86 0.82 0.91
WA-1H-LC-10000 8.39 0.75 0.49 0.89 0.94
CH-1H-LC-1000 0.48 0.00 0.00 0.91 0.92
CH-1H-LC-10000 4.85 0.77 0.05 0.94 0.95
WA-1D-LC-1000 0.46 0.51 0.13 0.98 0.86
WA-1D-LC-10000 4.60 1.00 0.02 0.98 0.88
CH-1D-LC-1000 0.34 0.15 0.22 0.98 0.79
CH-1D-LC-10000 3.36 0.89 0.36 0.95 0.87
Table 5.6: Performance comparison of the proposed method and BotGrep on LEET-Chord
graphs embedded in Abilene Trace Graphs - Precision and Recall
with recall of above 0.85 and precision over 0.9 in most cases. This is in sharp contrast to that
of BotGrep. When using the default parameters according to the original BotGrep paper very
low accuracies were obtained. However in the original paper very high accuracies (>90%) are
reported for the LEET-CHORD topology. It may be possible to tweak the parameters in order
to obtain better results on the datasets considered.
5.6.3 Efficiency and Scalability
In this section the runtimes for the proposed method and BotGrep are compared. The runtime
for the proposed method is dominated by the first call toGreedyOptimizationQw−log−v(Algorithm
3), which takes O(|E|) time as the number of iterations t is typically not a function of the size
of the graph.
In order to test the running time of the proposed method and compare it with BotGrep, we
consider the Abilene background datasets , and we embed a CHORD graph of size 1000 in all
cases to profile the running time of the method. All experiments are done on a machine with
32-core AMD Opteron 2.4Ghz based processor with 128 GB of RAM , only a single core is
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 82
used. In figure 5.4, the runtime is plotted against the size of the graph.It can be clearly ob-
Figure 5.4: Runtime comparison of proposed method and BotGrep:Abilene 1 Day Traces and
1000 Node CHORD Botnet
served from the figure that the proposed method scales significantly better that BotGrep, being
about 350 times faster in graphs with more than 10 million edges.
Performance on the CAIDA Datasets
In the previous section, it is shown that the proposed method outperformed BotGrep in terms of
runtime. In this section the method is tested on the CAIDA-CH and CAIDA-SJ Datasets of 2.7
million and 8.2 million nodes respectively (Section 4.3.2. CHORD, KADEMLIA, KOORDE
and LEET-CHORD graphs of 10000 and 100000 nodes are considered for the following exper-
iments. The Precision, Recall and FScores (computed as per Section 5.5.1 ) are shown in Table
5.7. The X-Means algorithm used by BotGrep in the prefiltering step is not able to scale to
handle these graphs in the tested machine. In the case of the proposed method, the time taken
to process the largest graph (56 million directed edges) is less than 20 minutes on a single core.
The Table 5.7 indicates the proposed method achieves good FScores >0.95 in the case of the
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 83
CAIDA CHICAGO Graphs
Dataset %BotsProposed Method
precision recall FScore
CAIDA-CH-CHO-10000 0.37 0.94 0.93 0.93
CAIDA-CH-CHO-100000 3.7 1 0.92 0.96
CAIDA-CH-KOO-10000 0.37 0.9 0.95 0.92
CAIDA-CH-KOO-100000 3.7 0.97 0.94 0.96
CAIDA-CH-KAD-10000 0.37 0.62 0.94 0.75
CAIDA-CH-KAD-100000 3.7 0.95 0.93 0.94
CAIDA-CH-LC-10000 0.37 1 0.95 0.97
CAIDA-CH-LC-100000 3.7 1 0.93 0.96
CAIDA SANJOSE Graphs
Dataset %BotsProposed Method
precision recall FScore
CAIDA-SJ-CHO-10000 0.12 0.56 0.95 0.7
CAIDA-SJ-CHO-100000 1.2 0.97 0.95 0.96
CAIDA-SJ-KOO-10000 0.12 0.42 0.94 0.58
CAIDA-SJ-KOO-100000 1.2 0.83 0.93 0.88
CAIDA-SJ-KAD-10000 0.12 0.83 0.93 0.88
CAIDA-SJ-KAD-100000 1.2 1 0.92 0.96
CAIDA-SJ-LC-10000 0.12 0.72 0.96 0.82
CAIDA-SJ-LC-100000 1.2 0.97 0.94 0.95
Table 5.7: Performance of the Proposed Method on CAIDA Datasets - Precision, Recall and
FScore
CAIDA-CH trace graph, and FScores of >0.85 in the case of the CAIDA-SJ trace graph. The
lower accuracy observed in the case of the 10000 node botnets in the latter (<0.8) is due to
the very small percentage of the bots (0.1 % of the dataset), which affects the precision of the
Chapter 5. A robust algorithm for identification of Structured P2P Botnets 84
algorithm (about 0.5), but the number of false positives is still very small compared to the size
of the whole graph (0.1%).
5.7 Conclusion
In this chapter a robust and efficient method to detect nodes that are a part of a structured P2P
botnet given a traffic graph. The proposed algorithm exploited the small but homogeneous
communities obtained by a greedy optimization of the function Qw−log−v proposed and suc-
cessfully applied in Protein Interaction Networks by VanLaarhoven and Marchiori [73]. The
greedy optimization is shown to produce more homogeneous communities than the optimiza-
tion of Qw−log−v using the Louvain method and thus chosen for use in the proposed algorithm.
The differences of the topological properties - assortativity and density, of structured P2P bot-
net communities and benign communities were discussed. In order to exploit these differences,
a novel measure mean regular degree mreg is proposed which captured the assortativity and
the density of a graph and the properties of mreg were studied
The proposed method is a two stage algorithm that used greedy optimization of Qw−log−v fol-
lowed by a filtering of communities with low values ofmreg in the first stage. The second stage
involved greedy optimization of Modularity and a second filtering of communities in order to
detect the set of nodes likely to be structured P2P bots
The proposed method is extensively validated, and found to be comparable in performance
with BotGrep. It is found to be reasonably robust to conditions of partial visibility barring the
case of very dense background graphs, and achieved good performance on the harder to detect
LEET-Chord topology.
The runtime of the algorithm is found to be 300 times lower than BotGrep on graphs of tens of
millions of edges. The algorithm is able to handle a 8 million node and 50 million edge graph
in roughly 20 minutes on a single core.
Chapter 6
Summary and Conclusions
6.1 Summary of Contributions
The summary of the major contributions and conclusions from this thesis are provided below.
6.1.1 Efficiency Comparison of Community Detection Algorithms
This thesis surveys the popular community detection algorithms proposed in literature. Al-
gorithms with low theoretical time complexities such as Label Propagation [54], Infomap
[56] and Louvain Method [3] have been implemented and compared on large LFR benchmark
graphs to study their efficiency.
6.1.2 Detection of Structured P2P Botnets using the Louvain Method
From the efficiency comparison of Community Detection Algorithms, the Louvain method is
selected to detect structured P2P botnets, as it was found to perform the fastest.
The Louvain method allows multiple objective functions to be optimized in a multi-level
greedy process. The method originally used Modularity as the objective function. This has
been applied in this thesis to detect structured P2P botnets on synthetically generated datasets
as per [44]. The dataset generation process involves generating topology graphs of CHORD
85
Chapter 6. Summary and Conclusions 86
[65], KOORDE [33] and KADEMLIA [42] and embedding them in a background graph con-
structed from real-world network traces. The traces were from two different sources. The first
set of traces include NetFlow data captured at core routers of the Abilene ISP. The second set
comprised of packet traces captured at an Internet Point of Presence (PoP). It is found that
performance of Modularity maximization using the Louvain method is comparable to that of
BotGrep in the cases of sparse background graphs, . In the cases where the density of the
background is high, this method resulted in a large number of benign nodes being detected
as bots. This is due to the resolution limit of modularity and its preference for large loosely
connected communities.
In order to overcome this limitation the Louvain method is then used to optimize Stability, a
multiresolution objective function proposed by Lambiotte et al. [36]. This requires the setting
of a resolution parameter t to control the size of the communities. This is then applied to detect
structured P2P botnets. Although for certain values of t the performance is comparable to that
of BotGrep for small botnets, several runs of the method at different values of t had to be run
in order to detect all sizes of botnets, increasing the computational complexity.
6.1.3 Robust and Efficient method to detect Structured P2P Botnets
It is found that neither Modularity Optimization nor Stability Optimization are robust or gen-
eral. However the optimization of Stability at values of t < 1 is found to result in small but
homogeneous communities. In order to overcome the limitations of setting the parameter t, a
third objective functionQw−log−v proposed by VanLaarhoven and Marchiori [73] is considered.
This objective function has previously been used in the case of Protein Interaction Networks
successfully, and used in this thesis to detect structured P2P botnets for the first time. It is
also shown that a single-step greedy optimization of Qw−log−v resulted in more homogeneous
communities than the optimization of the same function using a multi-level method such as
Louvain.
The differences of the topological properties - assortativity and density, of structured P2P bot-
net communities and benign communities were discussed. In order to exploit these differences,
Chapter 6. Summary and Conclusions 87
a novel measure mean regular degree mreg is proposed which capture the assortativity and the
density of a graph and the properties of mreg were studied. The proposed algorithm combines
the use of greedy community detection by optimizing Qw−log−v and community filtering using
mreg in order to identify nodes that are a part of a structured P2P botnet.
Accuracy
The algorithm is tested extensively on a large number of datasets and found to be comparable
in performance with BotGrep in most cases. In the case of the Abilene datasets, the proposed
method achieves FScores of about 0.9 in the majority of the cases. The proposed method
is reasonably robust under conditions of partial visibility - in the case of sparser background
graphs it achieves FScores around 0.9 and is comparable to BotGrep, the performance is af-
fected when datasets with denser backgrounds are considered, but it still achieves FScores of
0.7. The proposed method achieves FScores of around 0.9 in the case of the sparser and harder
to detect LEET-Chord botnet topology[30].
Efficiency
The runtime of the algorithm is found to be 300 times lower than BotGrep on graphs of tens of
millions of edges. The algorithm is able to handle a 8 million node and 50 million edge graph
in roughly 20 minutes on a single core, which is lesser than the time duration of the captured
traces, indicating that the method can be applied in realtime.
6.2 Directions for Future Work
The accuracy of the Louvain method may be increased by considering the packet level features
over the graph topology. These features need to be captured only for the hosts not discarded
at the end of Stage 1 of the algorithm, which may be feasible even for backbone level traffic.
These features can then be used to weight the graph edges based on feature vector similarity
functions.
Chapter 6. Summary and Conclusions 88
The scalability of the Louvain method can be further improved by implementing it on a dis-
tributed environment, allowing it to handle graphs with very high memory requirements
The field of botnet detection in general can benefit greatly from the rapid improvements tak-
ing place in the field of Community Detection Algorithms. Dynamic Community Detection
algorithms can incorporate temporal features of the traffic graph to perform incremental detec-
tion/tracking of botnets. Overlapping Community Detection Algorithms can be used to handle
cases where a node is infected with more than one bot. Local Community Identification Algo-
rithms can be used to detect other members of the botnet given a few seed nodes.
References
[1] Paul Barford and Vinod Yegneswaran. An inside look at Botnets. In Mihai Christodor-
escu, Somesh Jha, Douglas Maughan, Dawn Song, and Cliff Wang, editors, Malware
Detection, volume 27 of Advances in Information Security, pages 171–191. Springer US,
Boston, MA, 2007.
[2] James R. Binkley and Suresh Singh. An algorithm for anomaly-based botnet detection.
page 7, July 2006.
[3] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast
unfolding of communities in large networks. Journal of Statistical Mechanics: Theory
and Experiment, 2008(10):6, March 2008.
[4] Hyunsang Choi, Heejo Lee, and Hyogon Kim. BotGAD. In COMSWARE ’09 Proceed-
ings of the Fourth International ICST Conference on COMmunication System softWAre
and middlewaRE, page 1, New York, New York, USA, June 2009. ACM Press.
[5] Benoit Claise. Cisco systems netflow services export version 9. 2004.
[6] Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding community structure
in very large networks. Physical review E, 70(6):066111, 2004.
[7] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions
in empirical data. SIAM review, 51(4):661–703, 2009.
[8] Michele Coscia, Fosca Giannotti, and Dino Pedreschi. A Classification for Community
Discovery Methods in Complex Networks. 2012.
89
REFERENCES 90
[9] Baris Coskun, Sven Dietrich, and Nasir Memon. Friends of an enemy. In ACSAC ’10 Pro-
ceedings of the 26th Annual Computer Security Applications Annual Conference, page
131, New York, New York, USA, December 2010. ACM Press.
[10] David Dagon. Botnet detection and response. In OARC workshop, volume 2005, 2005.
[11] David Dagon, Guofei Gu, Christopher P. Lee, and Wenke Lee. A Taxonomy of Botnet
Structures. In ACSAC ’07 Proceedings of the 23rd Annual Computer Security Applica-
tions Annual Conference, pages 325–339. IEEE, December 2007.
[12] Inc. Damballa. Household Botnet Infections, 2012.
[13] Carlton R. Davis, Stephen Neville, Jose M. Fernandez, Jean-Marc Robert, and John
Mchugh. Structured Peer-to-Peer Overlay Networks: Ideal Botnets Command and Con-
trol Infrastructures? In Sushil Jajodia and Javier Lopez, editors, ESORICS ’08 Pro-
ceedings of the 13th European Symposium on Research in Computer Security: Computer
Security, volume 5283 of Lecture Notes in Computer Science, pages 461–480, Berlin,
Heidelberg, October 2008. Springer Berlin Heidelberg.
[14] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigen-
vectors a multilevel approach. Pattern Analysis and Machine Intelligence, IEEE Trans-
actions on, 29(11):1944–1957, 2007.
[15] Maryam Feily, Alireza Shahrestani, and Sureswaran Ramadass. A Survey of Botnet and
Botnet Detection. In ICESIST ’09 Third International Conference on Emerging Security
Information, Systems and Technologies, pages 268–273. IEEE, June 2009.
[16] Charles M Fiduccia and Robert M Mattheyses. A linear-time heuristic for improving
network partitions. In Design Automation, 1982. 19th Conference on, pages 175–181.
IEEE, 1982.
[17] Lester R Ford and Delbert R Fulkerson. Maximal flow through a network. Canadian
Journal of Mathematics, 8(3):399–404, 1956.
REFERENCES 91
[18] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174,
February 2010.
[19] Santo Fortunato and Marc Barthelemy. Resolution limit in community detection. Pro-
ceedings of the National Academy of Sciences, 104(1):36–41, 2007.
[20] Jerome Francois, Shaonan Wang, Radu State, and Thomas Engel. BotTrack: tracking
botnets using NetFlow and PageRank. In NETWORKING ’11 Proceedings of the 10th
international IFIP TC 6 conference on Networking, pages 1–14, May 2011.
[21] Frederic Giroire, Jaideep Chandrashekar, Nina Taft, Eve Schooler, and Dina Papagian-
naki. Exploiting Temporal Persistence to Detect Covert Botnet Channels. In RAID ’09
Proceedings of the 12th International Symposium on Recent Advances in Intrusion De-
tection, pages 326 – 345, 2009.
[22] M. Girvan and M. E. J. Newman. Community structure in social and biological net-
works. Proceedings of the National Academy of Sciences of the United States of America,
99(12):7821–6, June 2002.
[23] Sergey Golovanov and Igor Soumenkov. TDL4 – Top Bot, 2011.
[24] Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. BotMiner: clustering analy-
sis of network traffic for protocol- and structure-independent botnet detection. In USENIX
Security ’08 Proceedings of the 17th USENIX Security Symposium, pages 139–154, July
2008.
[25] Guofei Gu, Phillip Porras, Vinod Yegneswaran, Martin Fong, and Wenke Lee. BotH-
unter: detecting malware infection through IDS-driven dialog correlation. In USENIX
Security ’07 Proceedings of the 16th USENIX Security Symposium, page 12, August
2007.
[26] Guofei Gu, Junjie Zhang, and Wenke Lee. BotSniffer: Detecting botnet command and
control channels in network traffic. In NDSS ’08 Proceedings of the 15th Annual Network
and Distributed System Security Symposium, page 18. Citeseer, 2008.
REFERENCES 92
[27] Nicholas Ianelli and Aaron Hackworth. Botnets as a vehicle for online crime. Technical
report, CERT Coordination Center, 2005.
[28] Marios Iliofotou, Brian Gallagher, Tina Eliassi-Rad, Guowu Xie, and Michalis Falout-
sos. Profiling-By-Association. In Co-NEXT ’10 Proceedings of the 6th International
COnference on emerging Networking EXperiments and Technologies, page 1, New York,
New York, USA, November 2010. ACM Press.
[29] Padmini Jaikumar and Avinash C. Kak. A graph-theoretic framework for isolating botnets
in a network. Security and Communication Networks, pages n/a–n/a, February 2012.
[30] Mark Jelasity and Vilmos Bilicki. Towards automated detection of peer-to-peer botnets:
on the limits of local approaches. In LEET ’09 Proceedings of the 2nd USENIX confer-
ence on Large-scale exploits and emergent threats: botnets, spyware, worms, and more,
page 3, April 2009.
[31] Jham3. File:Network Community Structure.svg, 2011.
[32] Hongling Jiang and Xiuli Shao. Detecting P2P botnets by discovering flow dependency
in C&C traffic. Peer-to-Peer Networking and Applications, June 2012.
[33] M Frans Kaashoek and David R Karger. Koorde: A simple degree-optimal distributed
hash table. In Peer-to-Peer Systems II, pages 98–107. Springer, 2003.
[34] George Karypis and Vipin Kumar. Metis-unstructured graph partitioning and sparse ma-
trix ordering system, version 2.0. 1995.
[35] BW Kernighan and S Lin. An eflicient heuristic procedure for partitioning graphs. Bell
system technical journal, 1970.
[36] R Lambiotte, J. C. Delvenne, and M Barahona. Laplacian Dynamics and Multiscale
Modular Structure in Networks. arXiv preprint arXiv:0812.1770, pages 1–29, December
2008.
REFERENCES 93
[37] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for
testing community detection algorithms. page 6, May 2008.
[38] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of
linear differential and integral operators. United States Governm. Press Office, 1950.
[39] Wei Lu, Mahbod Tavallaee, and Ali A. Ghorbani. Automatic discovery of botnet com-
munities on large-scale communication networks. In ASIACCS ’09 Proceedings of the
4th International Symposium on Information, Computer, and Communications Security,
page 1, New York, New York, USA, March 2009. ACM Press.
[40] James MacQueen et al. Some methods for classification and analysis of multivariate
observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics
and probability, volume 1, page 14. California, USA, 1967.
[41] Mohammad M. Masud, Tahseen Al-khateeb, Latifur Khan, Bhavani Thuraisingham, and
Kevin W. Hamlen. Flow-based identification of botnet traffic by mining multiple log
files. In 2008 First International Conference on Distributed Framework and Applica-
tions, pages 200–206. IEEE, October 2008.
[42] Petar Maymounkov and David Mazieres. Kademlia: A peer-to-peer information system
based on the xor metric. In Peer-to-Peer Systems, pages 53–65. Springer, 2002.
[43] Hardy Michael. File:De bruijn graph-for binary sequence of order 4.svg, 2006.
[44] Shishir Nagaraja, Prateek Mittal, Chi-Yao Hong, Matthew Caesar, and Nikita Borisov.
BotGrep: finding P2P bots with structured graph analysis. In USENIX Security’10 Pro-
ceedings of the 19th USENIX Security Symposium, pages 7–7, August 2010.
[45] Mark EJ Newman. Mixing patterns in networks. Physical Review E, 67(2):026126, 2003.
[46] Mark EJ Newman. A measure of betweenness centrality based on random walks. Social
networks, 27(1):39–54, 2005.
REFERENCES 94
[47] Mark EJ Newman. Finding community structure in networks using the eigenvectors of
matrices. Physical review E, 74(3):036104, 2006.
[48] Mark EJ Newman. Modularity and community structure in networks. Proceedings of the
National Academy of Sciences, 103(23):8577–8582, 2006.
[49] Pascal Pons and Matthieu Latapy. Computing communities in large networks using
random walks. In Computer and Information Sciences-ISCIS 2005, pages 284–293.
Springer, 2005.
[50] Phillip Porras, Hassen Saidi, and Vinod Yegneswaran. A Multi-perspective Analysis of
the Storm (Peacomm) Worm, 2007.
[51] Phillip Porras, Hassen Saıdi, and Vinod Yegneswaran. A foray into Conficker’s logic
and rendezvous points. In 2nd Usenix Workshop on Large-Scale Exploits and Emergent
Threats (LEET’09), page 7, April 2009.
[52] Bill Pringlemeir. File:Dht example.png, 2007.
[53] Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico
Parisi. Defining and identifying communities in networks. Proceedings of the National
Academy of Sciences of the United States of America, 101(9):2658–2663, 2004.
[54] Usha Nandini Raghavan, Reka Albert, and Soundar Kumara. Near linear time al-
gorithm to detect community structures in large-scale networks. Physical Review E,
76(3):036106, 2007.
[55] Matei Ripeanu. Peer-to-peer architecture case study: Gnutella network. In Peer-to-Peer
Computing, 2001. Proceedings. First International Conference on, pages 99–100. IEEE,
2001.
[56] Martin Rosvall, Daniel Axelsson, and Carl T Bergstrom. The map equation. The Euro-
pean Physical Journal Special Topics, 178(1):13–23, 2009.
REFERENCES 95
[57] Venu Satuluri and Srinivasan Parthasarathy. Scalable graph clustering using stochastic
flows: applications to community discovery. In Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 737–746.
ACM, 2009.
[58] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, August
2007.
[59] Michael T Schaub, Jean-Charles Delvenne, Sophia N Yaliraki, and Mauricio Barahona.
Markov dynamics as a zooming lens for multiscale community detection: non clique-like
communities and the field-of-view limit. PloS one, 7(2):e32210, 2012.
[60] Antoine Schonewille and Dirk-Jan van Helmond. The domain name service as an IDS.
Research Project for the Master System-and Network Engineering at the University of
Amsterdam, 2006.
[61] Xuemin Shen, Heather Yu, John Buford, and Mursalin Akon. Handbook of peer-to-peer
networking, volume 1. Springer Heidelberg, 2010.
[62] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Anal-
ysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000.
[63] Sergio S.C. Silva, Rodrigo M.P. Silva, Raquel C.G. Pinto, and Ronaldo M. Salles. Bot-
nets: A survey. Computer Networks, October 2012.
[64] Stefan Ortloff. FAQ: Disabling the new Hlux/Kelihos Botnet, 2013.
[65] Ion Stoica, Robert Morris, David Karger, M Frans Kaashoek, and Hari Balakrishnan.
Chord: A scalable peer-to-peer lookup service for internet applications. In ACM SIG-
COMM Computer Communication Review, volume 31, pages 149–160. ACM, 2001.
[66] W. Strayer, Robert Walsh, Carl Livadas, and David Lapsley. Detecting Botnets with Tight
Command and Control. In Proceedings. 2006 31st IEEE Conference on Local Computer
Networks, pages 195–202. IEEE, November 2006.
REFERENCES 96
[67] Inc. Symantec. Internet Security Threat Report, 2013.
[68] Seth Terashima. File:Chord network.png, 2010.
[69] Gergely Tibely and Janos Kertesz. On the equivalence of the label propagation method
of community detection and a potts model approach. Physica A: Statistical Mechanics
and its Applications, 387(19):4982–4984, 2008.
[70] Vincent A Traag, Paul Van Dooren, and Y Nesterov. Narrow scope for resolution-limit-
free community detection. Physical Review E, 84(1):016114, 2011.
[71] Stijn Marinus van Dongen. Graph clustering by flow simulation. 2000.
[72] Twan van Laarhoven and Elena Marchiori. Robust community detection methods with
resolution parameter for complex detection in protein protein interaction networks. In
Pattern Recognition in Bioinformatics, pages 1–13. Springer, 2012.
[73] Twan van Laarhoven and Elena Marchiori. Graph clustering with local search optimiza-
tion: the resolution bias of the objective function matters most. Physical review. E,
Statistical, nonlinear, and soft matter physics, 87(1):012812, January 2013.
[74] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing,
17(4):395–416, 2007.
[75] C Walsworth, E Aben, K Claffy, and D Andersen. The ucsd caida anonymized 2011
internet traces, 2011.
[76] Ping Wang, Sherri Sparks, and Cliff C Zou. An advanced hybrid peer-to-peer botnet.
Dependable and Secure Computing, IEEE Transactions on, 7(2):113–127, 2010.
[77] Guanhua Yan, Stephan Eidenbenz, Sunil Thulasidasan, Pallab Datta, and Venkatesh
Ramaswamy. Criticality analysis of Internet infrastructure. Computer Networks,
54(7):1169–1182, May 2010.
REFERENCES 97
[78] Ting-Fang Yen and Michael K. Reiter. Are Your Hosts Trading or Plotting? Telling
P2P File-Sharing and Bots Apart. In ICDCS ’10 IEEE 30th International Conference on
Distributed Computing Systems, pages 241–252. IEEE, June 2010.
[79] Ting-Fang Yen and Michael K. Reiter. Revisiting botnet models and their implications
for takedown strategies. In Pierpaolo Degano and Joshua D. Guttman, editors, POST’12
Proceedings of the First international conference on Principles of Security and Trust,
volume 7215 of Lecture Notes in Computer Science, pages 249–268, Berlin, Heidelberg,
March 2012. Springer Berlin Heidelberg.
[80] Hossein Rouhani Zeidanloo, M. Safari, and Mazdak Zamani. A taxonomy of Botnet
detection techniques. In ICCSIT ’10 3rd International Conference on Computer Science
and Information Technology, pages 158–162. IEEE, July 2010.
[81] Junjie Zhang, Roberto Perdisci, Wenke Lee, Unum Sarfraz, and Xiapu Luo. Detecting
stealthy P2P botnets using statistical traffic fingerprints. In 2011 IEEE/IFIP 41st Inter-
national Conference on Dependable Systems & Networks (DSN), pages 121–132. IEEE,
June 2011.