109
Fast Identification of Structured P2P Botnets using Community Detection Algorithms ATHESIS S UBMITTED F OR THE DEGREE OF Master of Science (Engineering) IN THE FACULTY OF ENGINEERING by Bharath Venkatesh Supercomputer Education and Research Centre Indian Institute of Science BANGALORE – 560 012 July 2013

Fast Identi cation of Structured P2P Botnets using ... · Fast Identi cation of Structured P2P Botnets using Community Detection Algorithms ... method on CHORD Botnets embedded in

Embed Size (px)

Citation preview

Fast Identification of Structured P2P Botnets usingCommunity Detection Algorithms

A THESIS

SUBMITTED FOR THE DEGREE OF

Master of Science (Engineering)

IN THE FACULTY OF ENGINEERING

by

Bharath Venkatesh

Supercomputer Education and Research Centre

Indian Institute of Science

BANGALORE – 560 012

July 2013

i

©Bharath Venkatesh

July 2013All rights reserved

TO

My Parents

Prof N Balakrishnan

Sudip and Naimisha

Acknowledgements

I wish to express my sincerest gratitude to my research supervisor Prof. N. Balakrishnan. His

mastery of diverse subjects, thoughtful guidance and ideas opened up a completely new vista

of knowledge for me and reshaped my way of thinking. His hard work and dedication to sci-

ence make him my role model whom I will always look upto and cherish the invaluable time

spent with him. I am also indebted for life to him for his utmost support, encouragement and

inspiration throughout the period. It is because of him that I was able to move to the exciting

world of Computer Science.

I thank Prof. R. Govindarajan, the chairman of SERC, and my course advisors who have

helped me immensely during the entire course of my stay in IISc. I always feel fortunate for

the lifetime opportunity to work in this institute alongside all eminent scientists and stay in

this wonderful campus.

I thank Shishir Nagaraja for providing datasets used in this thesis and implementation guide-

lines and source code for the BotGrep Algorithm. I also thank him for helpful discussions

related to this work. Prof. Virgilio Almeida, Ponnurangam K and all others who visited our

lab, they enhanced my general exposure and provided suggestions to improve the quality and

presentation of my work.

I am very grateful to Ms. Nagarathna, Ms. Swarna and Mr. Ravi for all their support through-

out my tenure. I also thank SERC and the Information Systems Lab for providing us the best

computing facilities. I am very thankful

I feel always very fortunate to have been a part of a wonderful family during my stay in the lab,

which has been my home away from home. I would like to especially thank Sudip who has

had a great influence on my approach to research and life. My heartfelt thanks to Naimisha,

i

ii

who has been the big sister I have always wanted. Without her, this thesis would have never

taken shape. Special thanks to Saradha who completes this amazing gang of four which was

one of the most important parts of my life in IISc. Nikhil has also helped me in several areas

of my work and has been great company. I would also like to thank Pritam, Prashant, Venkat,

Indira Ma’am, Negi, Nivedita and all other members of Information Systems Lab who have

continuously supported me. I am indebted to all of them for providing a healthy atmosphere,

stimulating and fun environment to learn and grow. I will always cherish the moments spent

with them.

My heartful appreciation to Aravind, Kamala, Gopal, Hari K, Ashwin and all other friends

from IISc for extending a helping hand and making my stay at IISc memorable for lifetime.

I would also like to acknowledge the tremendous support received from Satyaki, Abhilash,

Prasanth, Lavanya and many other friends outside IISc.

Lastly, but most importantly, I greatly thank my parents for their unconditional support and

love throughout. They mean the world to me. I would also like to thank Mrs Padmavathy

Sundarrajan, a new found friend who helped keep my spirits up especially when it was most

needed.

Abstract

Botnets are a global problem, and effective botnet detection requires cooperation of large In-

ternet Service Providers, allowing near global visibility of traffic that can be exploited to detect

them. The global visibility comes with huge challenges, especially in the amount of data that

has to be analysed. To handle such large volumes of data, a robust and effective detection

method is the need of the hour and it must rely primarily on a reduced or abstracted form of

data such as a graph of hosts, with the presence of an edge between two hosts if there is any

data communication between them. Such an abstraction would be easy to construct and store,

as very little of the packet needs to be looked at.

Structured P2P command and control have been shown to be robust against targeted and ran-

dom node failures, thus are ideal mechanisms for botmasters to organize and command their

botnets effectively. Thus this thesis develops a scalable, efficient and robust algorithm for the

detection of structured P2P botnets in large traffic graphs. It draws from the advances in the

state of the art in Community Detection, which aim to partition a graph into dense communi-

ties.

Popular Community Detection Algorithms with low theoretical time complexities such as La-

bel Propagation, Infomap and Louvain Method have been implemented and compared on large

LFR benchmark graphs to study their efficiency. Louvain method is found to be capable of han-

dling graphs of millions of vertices and billions of edges. This thesis analyses the performance

of this method with two objective functions, Modularity and Stability and found that neither

of them are robust and general.

iii

iv

In order to overcome the limitations of these objective functions, a third objective function

proposed in the literature is considered. This objective function has previously been used in

the case of Protein Interaction Networks successfully, and used in this thesis to detect struc-

tured P2P botnets for the first time. Further, the differences in the topological properties -

assortativity and density, of structured P2P botnet communities and benign communities are

discussed. In order to exploit these differences, a novel measure based on mean regular degree

is proposed, which captures both the assortativity and the density of a graph and its properties

are studied.

This thesis proposes a robust and efficient algorithm that combines the use of greedy com-

munity detection and community filtering using the proposed measure mean regular degree.

The proposed algorithm is tested extensively on a large number of datasets and found to be

comparable in performance in most cases to an existing botnet detection algorithm called Bot-

Grep and found to be significantly faster.

Contents

Acknowledgements i

Abstract iii

List of Tables viii

List of Figures ix

1 Introduction 11.1 Botnets and Botnet Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Complex Networks and Community Detection . . . . . . . . . . . . . . . . . 21.3 Motivation and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Botnets and Botnet Detection 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Operation of a Typical Bot . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Botnet Command and Control . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Centralized Command and Control . . . . . . . . . . . . . . . . . . 102.3.2 Decentralized or Peer-to-Peer (P2P) Command and Control . . . . . 11

2.4 Botnet Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Methods of Botnet Detection . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Detection of Structured P2P Botnets in Large Scale Networks . . . . . . . . . 192.5.1 BotGrep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Community Detection Algorithms 233.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Edges, Directionality and Weights . . . . . . . . . . . . . . . . . . . 243.2.2 Adjacency Matrix, Degree and Transition Probability Matrix . . . . . 243.2.3 Degree Distributions and Random Graph Models and Assortativity . . 253.2.4 Power Laws or Scale Free Networks . . . . . . . . . . . . . . . . . . 25

v

CONTENTS vi

3.2.5 Random Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.6 Assortativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.7 Paths, Connected Components, and Betweenness Centrality . . . . . 263.2.8 Subgraphs, Covers and Partitions . . . . . . . . . . . . . . . . . . . 27

3.3 Community and Community Structure . . . . . . . . . . . . . . . . . . . . . 273.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.2 Partition Quality Functions . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Community Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . 303.5 Efficiency Comparison of Community Detection Algorithms . . . . . . . . . 35

3.5.1 Dataset - LFR Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 353.5.2 Discussion and Algorithm Selection . . . . . . . . . . . . . . . . . . 363.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Identifying Structured P2P Botnets using the Louvain Method 384.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 The Louvain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Greedy Modularity Optimization . . . . . . . . . . . . . . . . . . . . 404.2.2 Community Aggregation . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.1 Network Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3.2 Background Graph Construction and Properties . . . . . . . . . . . . 434.3.3 Structured P2P Graph Generation . . . . . . . . . . . . . . . . . . . 444.3.4 Embedding the Botnet . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Application of the Louvain method to identify structured P2P botnets . . . . 464.4.1 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 464.4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 Community Detection at different resolutions and Multiresolution Modularity 514.5.1 Resolution Limit and Multiresolution Modularity . . . . . . . . . . . 514.5.2 Stability and Stability Optimization . . . . . . . . . . . . . . . . . . 52

4.6 Optimization of Stability using the Louvain Method to identify Structured P2PBotnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.6.1 Datasets and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 544.6.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 A robust algorithm for identification of Structured P2P Botnets 615.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2 Obtaining Small and Homogeneous Communities . . . . . . . . . . . . . . . 62

5.2.1 An alternative Objective Function - Qw−log−v . . . . . . . . . . . . . 625.2.2 Optimizing Qw−log−v - Louvain method vs Single Step Greedy Opti-

mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3 Differentiating between bot and benign communities . . . . . . . . . . . . . 66

5.3.1 Properties of Structured P2P Botnets vs Properties of the Background 66

CONTENTS vii

5.3.2 Properties of the small and homogeneous communities obtained by thegreedy optimization of Qw−log−v . . . . . . . . . . . . . . . . . . . . 67

5.3.3 Mean Regular Degree mreg . . . . . . . . . . . . . . . . . . . . . . . 675.4 Robust and efficient method to identify nodes that are part of structured P2P

Botnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.4.1 Stage 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.4.2 Stage 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.5.2 Performance on Abilene Trace Graphs . . . . . . . . . . . . . . . . . 72

5.6 Robustness of the proposed algorithm . . . . . . . . . . . . . . . . . . . . . 765.6.1 Robustness under conditions of partial visibility . . . . . . . . . . . . 765.6.2 Performance on the LEET-Chord Topology . . . . . . . . . . . . . . 795.6.3 Efficiency and Scalability . . . . . . . . . . . . . . . . . . . . . . . 81

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Summary and Conclusions 856.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.1.1 Efficiency Comparison of Community Detection Algorithms . . . . . 856.1.2 Detection of Structured P2P Botnets using the Louvain Method . . . 856.1.3 Robust and Efficient method to detect Structured P2P Botnets . . . . 86

6.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

References 89

List of Tables

4.1 Properties of Graphs extracted from the Network Traces . . . . . . . . . . . 444.2 Comparison of Louvain Modularity and BotGrep on Abilene Traces . . . . . 504.3 Performance Summary of Modularity Optimization using the Louvain Method,

with reference to BotGrep . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4 Optimizing Stability at different values of t on Abilene Traces using the Lou-

vain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Performance Summary of Modularity Optimization vs Stability Optimization(t=0.25)

using the Louvain Method, with reference to BotGrep . . . . . . . . . . . . . 59

5.1 Community Structure obtained by optimization ofQw−log−v using the Louvainmethod on CHORD Botnets embedded in Abilene WASH Router Trace Graphs 66

5.2 Performance comparison of the proposed method and BotGrep on AbileneTrace Graphs - Precision and Recall . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Summary of the proposed method with reference to BotGrep on the Abilenetraces (FScore) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Performance comparison of the proposed method and BotGrep on AbileneTrace Graphs under conditions of partial visibility - Precision and Recall . . . 78

5.5 Performance(FScore) Summary of the proposed method with reference to Bot-Grep on the Abilene traces under conditions of partial visibility . . . . . . . . 79

5.6 Performance comparison of the proposed method and BotGrep on LEET-Chordgraphs embedded in Abilene Trace Graphs - Precision and Recall . . . . . . . 81

5.7 Performance of the Proposed Method on CAIDA Datasets - Precision, Recalland FScore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

viii

List of Figures

2.1 Botnet Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Botnet Command and Control Topologies . . . . . . . . . . . . . . . . . . . 102.3 A CHORD Graph with 16 nodes(Image from [68]) . . . . . . . . . . . . . . 122.4 A DeBruijn Graph of 3 length string of alphabet 01 (Image [43]) . . . . . . . 122.5 A network partition for node 6 in a Kademlia network of 8 nodes([52]) . . . . 13

3.1 Communities in a Graph (Image from [31]) . . . . . . . . . . . . . . . . . . 283.2 Comparison of CDA on LFR Benchmarks . . . . . . . . . . . . . . . . . . . 363.3 Comparison of CDA on LFR Benchmarks - Louvain vs CNM . . . . . . . . . 36

4.1 The Louvain Method (Image from [3]) . . . . . . . . . . . . . . . . . . . . . 394.2 An example of a graph with a superimposed botnet (Image from [44]) . . . . 464.3 Performance of the Louvain Method on Abilene Trace Graphs . . . . . . . . 494.4 Performance of Stability Optimization on Abilene Trace Graphs . . . . . . . 554.5 Comparison of Stability Optimization(t=0.25) and BotGrep on Abilene Trace

Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.1 Performance comparison of the proposed method and BotGrep on AbileneTrace Graphs - FScore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Performance comparison of the proposed method and BotGrep on AbileneTrace Graphs under conditions of partial visibility - FScore . . . . . . . . . . 77

5.3 Performance comparison of the proposed method and BotGrep on LEET-Chordgraphs embedded in Abilene Trace Graphs - FScore . . . . . . . . . . . . . . 80

5.4 Runtime comparison of proposed method and BotGrep:Abilene 1 Day Tracesand 1000 Node CHORD Botnet . . . . . . . . . . . . . . . . . . . . . . . . 82

ix

Chapter 1

Introduction

1.1 Botnets and Botnet Detection

Computers today are subject to infections from a variety of malicious software or malware

such as viruses, trojans, worms, keyloggers. Such infections are a serious threat to the security

and privacy of the user. However cyber-criminals went a step further and created networks of

these malware infected or compromised hosts, that operate in coordination, ready to do their

bidding and engage in activities that potentially threaten the security of the entire Internet.

These networks of compromised computers (called zombies or bots) are called botnets. The

controller of these hosts – the botmaster or botherder can control the entire botnet remotely,

and thus has an illegal distributed cloud of computers in his possession, which he can exploit

for carrying out malicious activities for economic or political gain. Botnets are typically used

to send spam e-mail and are responsible for about 80% of the spam e-mail [67]. They are used

to execute Distributed Denial of Service Attacks(DDoS), perform click-fraud, host phishing

sites, harvest sensitive and private information such as credit card numbers and passwords,

[27] As of 2012, an estimated 3-7% of enterprise hosts, and 10% of home computers were

found to be bot-infected according to Damballa Inc[12].

Botnets are controlled and coordinated by a command and control channel that may be cen-

tralized or decentralized. Centralized mechanisms have one or more command and control

servers which the bots connect to in order to receive orders. Decentralized or peer-to-peer

1

Chapter 1. Introduction 2

(P2P) mechanisms are gaining preference among botmasters owing to their resilience and re-

sistance against targeted attacks, which would dismantle botnets having central servers. The

first step of mitigating the botnet threat is their detection.Host based methods operate similar to

anti-virus systems and detect activities of the bot in the host system. Network based methods

rely on features obtained by passive monitoring of network traffic. Network based approaches

are the most popular owing to the relative ease of deployment. A variety of techniques from

different areas have been applied to for network based detection of botnets. Traffic mining,

clustering, correlation, entropy analysis, stochastic modelling, time series analysis and other

machine learning based techniques have been proposed in literature and surveyed by Silva et

al. in [63].

1.2 Complex Networks and Community Detection

Most real world systems can be modelled as complex networks or graphs where vertices rep-

resent the entities and the presence of an edge represents an interaction of any kind between

them. The field of complex networks is aimed at studying the topological properties of these

networks and understanding their dependence on the function of the real world systems. The

field is interdisciplinary, and has seen contributions from biologists, computer scientists, physi-

cists and statisticians.

In most complex networks, there exists community structure , where certain groups or com-

munities of nodes are more tightly knit as compared to the rest of the graph. It is of interest

to study these communities, as they can reveal information about the structure and dynamics

of the system, and reveal entities that are similar. Community Detection Algorithms aim to

detect the community structure in a graph by partitioning it into densely connected subgraphs,

and have been a hot topic for research in the complex networks community.

With the modelling of network traffic data as graphs, and techniques such as community detec-

tion algorithms can brought from the field of complex networks to tackle problems in computer

network security such as botnet detection.

Chapter 1. Introduction 3

1.3 Motivation and Objective

Peer-to-Peer topologies can be structured and unstructured. Structured P2P topologies were

proven to be ideal topologies for Botnets by Davis et al.[13] on the basis of their resilience to

dismantling and destabilization. An important observation is that Botnets are a global threat.

Effective mitigation of this threat is in the interest of all nations and corporations, and inter-

national cooperation is needed. Botnet Detection and Mitigation is in interest of the Internet

Service Provider(ISP) as well, owing to wasted bandwidth on malicious applications and spam

e-mails typically associated with botnets.

Assuming co-operation of the most important or Tier-1 ISP’s passive monitors can be deployed

to collect traffic at the backbone routers of these large ISP’s. This will result in global visibility

of network traffic that can be exploited to detect botnets.

The large volume of traffic renders most of the current detection methods useless. In such a

setting only a reduced or abstracted form of the data can be effectively handled. A simple

abstraction is the construction of a graph with the nodes as hosts and edges if they send a

packet. Even after this abstraction, handling these large graphs (millions of nodes, hundreds

of millions of edges) is still a challenge.

Nagaraja proposed BotGrep[44] which works on such a graph constructed from network traf-

fic, and uses the topological properties of botnet command and control (C2C) communication

graphs to separate them from benign traffic.

BotGrep is tested on synthetic botnet topologies superimposed on a graph constructed from

real world backbone traces, and is found to give high accuracy on the datasets tested.

A structured P2P botnet should have a high internal connectivity among its members so as

to achieve robustness against targeted and random failures. This provides the motivation for

using community detection algorithms in order to detect them. Nagaraja et. al. [44] have also

compared the performance of BotGrep with several Community Detection Algorithms on a

scaled down sampled graph, and conclude that

Chapter 1. Introduction 4

“While these traditional techniques were not intended to scale to the large data sets we consider

here, they may be appropriate for localizing smaller botnets in contained environments (e.g.,

within a single Honeynet, or the part of a botnet contained within an enterprise network)”[44]

Community Detection Algorithms have received a great deal of attention over the last few

years, and several scalable algorithms have been proposed that are capable of handling graphs

of millions of vertices and billions of edges. They are very general and can be easily adapted

for large scale detection of structured P2P botnets

The aim of this thesis is to study in depth the applicability of Community Detection Algo-

rithms and to develop a scalable, efficient and robust algorithm for detection of structured P2P

subgraphs detection that draws from the advances in the state of art in community detection

and compare the performance with those of BotGrep

1.4 Approach

This thesis first surveys the community detection algorithms used in complex network. From

the theoretical analysis available in the literature on time complexity of the algorithms only

those which have time complexity linear in the number of edges such as Label Propagation

[54], Infomap [56] and Louvain Method [3] have been considered for implementation. This

exercise points to the conclusion that the Louvain method is most suitable for the identification

of structured P2P botnets in large networks.

The application of the Louvain method as proposed in [3] resulted in performances comparable

to BotGrep for sparsely connected background networks, while for dense background graphs it

is found to be inferior. A deeper analysis brought out the need to improve the original Louvain

method particularly in situations where the background is dense.

The Louvain method from [3] considers modularity as the objective function.StabilityOptimization

proposed by Lambiotte et al. [36] is a modification of modularity to incorporate a parameter t

which is the weight on the internal density of each community. The optimization of Stability

Chapter 1. Introduction 5

has also been implemented according to [36] and analysed for its applicability in botnet de-

tection. This technique seems to suffer from a disadvantage of the increase in computational

efforts to search for an optimal value of t.

This formed the motivation for exploring another objective function w-log-v proposed and

has been successfully applied in protein-protein interaction networks by Van Laarhoven and

Marchiori[72]. The optimization of w-log-v using the Louvain Method resulted in the identifi-

cation small fragments of either botnets or benign communities with high precision.

The thesis then proposes a method to distinguish between the benign and botnet communities

and aggregate the later. This has been achieved using a novel scoring function, which is based

on mean degree and degree homogeneity of the communities. Results from this combined

comprehensive technique has been presented and compared favourably with BotGrep

The overall process results in a novel technique of detecting structure P2P botnets while having

the advantage of being faster by almost 300 times for graphs of million edges or more.

1.5 Organization of the Thesis

The rest of this thesis is organized into 5 other Chapters

Chapter 2

This chapter describes Botnets and the lifecycle of a typical bot is studied. Botnet Command

and Control is described in detail and various Botnet Detection Techniques are described. Prior

art in detection of botnets in large networks. The BotGrep [44] method is reviewed in depth.

Chapter 3

This chapter contains a survey of Community Detection Algorithms (CDA). Algorithms found

to be scalable in terms of theoretical time complexity have been identified and implemented.

The running time of these algorithms are then compared on standard community detection

benchmark graphs.

Chapter 1. Introduction 6

Chapter 4

This chapter describes the defined Louvain Method in detail and applies it to detect structured

P2P botnets. For the purpose of the evaluation, the method to generate datasets is described.

The optimization of Modularity as well as Stability in order to detect structured P2P botnets is

studied.

Chapter 5

This chapter considers the Qw−log−v objective function as an alternative objective function for

optimization by the Louvain Method. The behaviour of the objective function on the datasets is

studied. The novel measure mreg is proposed to differentiate between benign and bot commu-

nities. The proposed algorithm that combines the greedy optimization of Qw−log−v and built

around mreg is described. A comprehensive evaluation of the method is carried out.

Chapter 6

This chapter summarises the contributions, makes concluding remarks and lays down direc-

tions for future work.

Chapter 2

Botnets and Botnet Detection

2.1 Introduction

Botnets are networks of compromised hosts (or bots) that can be remotely controlled by an

attacker (or botmaster). A bot is typically connected to the Internet, and has been infected

by some malware so as to be able to communicate with the botmaster. The botmaster can

control each bot remotely over the network, and can instruct it to update the malware installed

on the bot system, send spam e-mails, execute denial-of-service attacks and harvest private

information.The lifecycle of a typical bot is described in Section 2.2.

Bots are coordinated and controlled using a command and control (C2C) channel, which can be

centralized or peer-to-peer. The topology of the C2C determines the efficiency and resilience

of the botnet. A description of the C2C mechanisms of botnets is provided in Section 2.3.

Section 2.4 contains a survey of Botnet Detection techniques. Section 2.5 describes literature

on detection of botnets in large networks, with special emphasis on BotGrep[44].

2.2 Operation of a Typical Bot

A typical bot host has a life cycle as shown in Figure 2.1. The stages of this life cycle are

Infection Rallying Binary/Egg Download Wait for Orders Execution of Orders Termination

7

Chapter 2. Botnets and Botnet Detection 8

Figure 2.1: Botnet Life Cycle

Infection

The bot can enter the host through a variety of infection vectors – user-mediated or facilitated

actions such as opening malicious e-mail attachments, downloading of malicious software

from a phishing affected site. It can also enter through drive-by-download, where it enters

the users system without his knowledge by exploiting common web browser vulnerabilities.

Network based exploitation of vulnerabilities is also another way of infection.

Rallying

After a successful infection, a bot has to communicate with its botmaster to inform him of

its infection. This is normally done through hard-coded domain names or IP addresses, more

recently botnets have begun employing Domain Name Generation Algorithms[51] in order to

avoid the hard coding step to help hide the botmaster better. Domain Generation Algorithms

create random domain names using a seed known by the botmaster. The botmaster can then

register a small fraction of the domains from the sequence of generated algorithms. As bots try

and connect to several of these domains per day, there is a good chance they will hit a domain

Chapter 2. Botnets and Botnet Detection 9

name registered by the botmaster.

Binary Download/Update

In many cases, the initial binary only contains the code to enable the botmaster to rally the bots.

A back-door is typically installed to allow remote access of the system. Bots are designed to

be modular, with pluggable components to execute different actions. After rallying, the bots

may be instructed by the botmaster to download the appropriate modules[1]. For instance a

botnet designed to send spam e-mail will be instructed to download spam mail templates and

relevant code needed to send mails[50].

Wait for Orders

After the bot has downloaded all the necessary components, it will go into a wait state. The bot

code will continue to run on the system, periodically polling for commands from the botmaster.

If it has a spyware component, it will continue to log keystrokes and try to harvest other

sensitive data from the system.

Execute Orders

On receiving orders from the botmaster, the bot may participate in a DDoS attack, send spam,

download a list of URL’s to perform click-fraud or be instructed to crack passwords using brute

force techniques[27]. It may also be instructed to update itself with a newer version. Another

important action is to execute TCP scans of other Internet hosts in order to discover open and

vulnerable ports to exploit and recruit further bots.

Termination

In some cases, if the botmaster may wants to dismantle his botnet, the bot may be instructed

to delete itself from the host, clearing all records of its existence.

Chapter 2. Botnets and Botnet Detection 10

(a) Centralized ([76]) (b) Unstructured P2P ([64]) (c) Structured P2P ([68])

Figure 2.2: Botnet Command and Control Topologies

2.3 Botnet Command and Control

The distinguishing feature of a bot from other malware is its coordination with other bots

that are a part of the same botnet. This coordination is enabled through a command and

control channel (C2C) between the botmaster and his bots. The command and control can be

classified into centralized or decentralized depending on the communication topology among

the botmaster and the bots. A hybrid topology combining a centralized component and a peer-

to-peer component has also been proposed in literature[76].

2.3.1 Centralized Command and Control

In this topology, there exist one or more command and control servers controlled by the bot-

master that are used to issue orders to the bots, and all the bots are aware, and able to contact

these servers. Early botnets used the Internet Relay Chat (IRC) protocol, with C2C servers

hosting channels that individual bots can join. The botmaster can use the channel to push

commands to bots. The popularity of the IRC protocol waned owing to the rise of Instant

Messaging. In order to blend in with the predominant Hypertext Transfer Protocol(HTTP)

based web traffic, botnets began to use the HTTP protocol, where individual bots will pull

instructions from a web server.

Chapter 2. Botnets and Botnet Detection 11

2.3.2 Decentralized or Peer-to-Peer (P2P) Command and Control

The main weakness of centralized command and control is that it is prone to targeted attacks,

if the C2C servers are taken down, the botnet is completely paralysed. To overcome this

limitation, botmasters began to migrate to P2P Command and Control. A detailed survey of

Peer-to-Peer networking can be found in [61]. In this mechanism, there are no fixed command

and control servers. The botnet can be controlled by the botmaster from any node. A bot will

poll the network for orders by looking for a certain file of commands, this may be uploaded

to any node by the botmaster. Thus the botnet functions by issuing lookups for the command

files. The efficiency of this lookup is governed by the routing of the lookup message which in

turn depends on the geometry or organization of the hosts.Based on this Peer-to-peer control

can be further classified into unstructured or structured.

Unstructured P2P

In this mode of P2P communication, a peer randomly selects other peers to connect to, lookup

is carried out by flooding or random walks. Gnutella[55], a popular filesharing service, is an

example of an unstructured P2P network.

Structured P2P

In this mode of P2P communication, the peers are a part of a distributed hash table (DHT)

that stores key-value pairs. Each peer and each data item is identified by different unique IDs,

and a hash function maps the keys to the nodes. Each node maintains a routing table, and the

connections among the peers are structured in order to provide guarantees on the maximum

number of hops required to locate the data item. Churn resistance(Joining and Leaving of

hosts) is handled with the use of replication and redundancy. Popular DHTs include CHORD,

Koorde and Kademlia. CHORD

CHORD [65] It is a simple DHT, where keys are mapped to peers with the same IDs based

on consistent hashing using the SHA1 Algorithm. The peers are organized into a ring, with

each node storing, a total of log2N peers, where N is the number of nodes, this includes its

predecessor and successor in the ring. The graph is constructed using the rule that each node

Chapter 2. Botnets and Botnet Detection 12

Figure 2.3: A CHORD Graph with 16 nodes(Image from [68])

i will connect to nodes i − 1, and i + 1 to complete the ring, and have long-range links to

nodes i+2kmodNfork = 1...(log2N)−1. This forms the routing table or finger table of each

node. This is illustrated in Figure 2.3. For key lookup, the node in the finger table with the

closest ID will be asked to search for the key. This happens recursively and the lookup is done

in O(log N) time in a manner similar to binary search, where the distance between the source

node requesting the key, and the node that has the key is halved at every hop. KOORDE

KOORDE [33] It is a DHT similar to CHORD, however peers are organised according to

Figure 2.4: A DeBruijn Graph of 3 length string of alphabet 01 (Image [43])

a DeBruijn graph of constant degree k. A de Bruijn graph represents relationships between

strings. The node set V comprises of every possible string of length n from an alphabet of

Chapter 2. Botnets and Botnet Detection 13

m letters. The edge set E consists of directed edge between a string u to v if the former can

be transformed to the latter by removing the first letter and appending a letter. An example

DeBruijn Graph based on all 3 length strings of 0 and 1 is depicted in Figure 2.4. Consistent

hashing is used to map keys to nodes based on IDs. Lookup takes place by shifting k bits of

the key and checking if such a node exists in the finger table, if not it is again shifted by k until

an appropriate node is found, which recursively applies the same technique to find the file.

Thus lookup is done in O(logk N) hops. KADEMLIA The storm botnet[50] was based on

Figure 2.5: A network partition for node 6 in a Kademlia network of 8 nodes([52])

the Overnet/Kad Network which had a KADEMLIA based DHT, currently the TDL-4 Botnet

uses the Kad Network[23]. In the KADEMLIA DHT[42] peers are organized as the leaves of

a binary tree based on their ID’s. Each node views the entire tree as a partition of subtrees

(called buckets of logarithmically decreasing sizes (Figure 2.5 ) and has an entry to a node in

each bucket. In the example in Figure 2.5 a graph of 8 nodes with Ids 0-7 , the node with ID

(110)2 will have an edge with (111)2, (100)2 and (001)2. Lookup proceeds by computing the

nearest node by using an XOR distance between the key and the node ID’s in the routing table.

This process is then carried out recursively, halving the distance to the actual location allowing

lookup in O(log N) hops.

Chapter 2. Botnets and Botnet Detection 14

2.4 Botnet Detection

The first step in containing the botnet threat is detection. Several methods have been proposed

in literature. A comprehensive survey of these methods have been carried out in [63], [15]. A

taxonomy of detection techniques has been proposed in [80].

2.4.1 Preliminaries

Like Intrusion Detection Systems, Botnet Detection Systems can be classified as signature-

based or anomaly detection.

Signature based: Signature based methods depend on a database of known patterns or ’signa-

tures’ that are compiled from known instances using a certain process. Detection is performed

by repeating the same process that generated the signature to yield a test signature that is then

compared to the database to report an instance if any.

Anomaly Detection: Anomaly detection algorithms model normal behaviour of the system

based on a set of selected features, and flag instances that express abnormal values of one or

more of the selected features. Anomaly based detectors can detect unknown or ’zero-day’ in-

stances, which signature based methods fail to detect.

The above classification was based on the approach of detection, On the basis of the moni-

toring point, the detection systems can be classified as host-based or network-based.

Host-based Detection: The detection system is deployed on the individual end hosts, and

leverages features based on system logs, system state changes, and/or system call traces to

carry out detection.

Network based Detection: The detection system is deployed in the periphery of a network,

where the communication among external and internal hosts can be passively monitored, and

Chapter 2. Botnets and Botnet Detection 15

extract packet and flow based features to discriminate between benign and botnet traffic. Net-

work based methods are easier to deploy than host-based systems, and can observe the im-

portant coordinated network behaviour of botnets. Most of the methods surveyed in the next

section are network based methods and a method can be assumed to network-based unless

explicitly specified to be host-based.

2.4.2 Methods of Botnet Detection

Honeypots and Honeynets

A honeypot is a set of hosts that are made intentionally vulnerable, so as to attract attackers.

Honeypots get infected with various malware, and since they are under the control of the se-

curity community, they can be used to infiltrate a botnet, and study it. Honeynets are networks

of honeypots, to observe the network behaviour of malware, and can be used to detect botnets

in a small scale.

Correlation of various behaviours

A botnet will behave similar to the lifecycle in Figure 2.1, exhibiting two or more of the stages.

Detection methods can detect the presence of each stage, then these detection events can be

correlated to identify botnet activity. Binkley et al.[2] have proposed an algorithm to detect

suspicious IRC botnet channels, by identifying member hosts that exhibit TCP scan-like ac-

tivity.Scan like activity was detected by the use of a metric called TCP work weight, which

counts the fraction of TCP packets that have the SYN or RESET or FIN flags set over the total

number of TCP packets that is computed for every IRC channel detected by parsing the pay-

load of IRC packets. The disadvantage of this method is that it relies on the fact that bots must

use the IRC protocol for C2C. The method relies on deep packet inspection, and thus can be

easily defeated by encryption.It cannot detect botnets that do not rely on scans to proliferate.

A host based approach was devised by Masud et. al. [41]. In this method correlations of mul-

tiple log files are performed. A combination of features from exedump which logs application

Chapter 2. Botnets and Botnet Detection 16

traces, and tcpdump, which logs the network activity of the host are extracted. Botnet com-

mand flows are categorised into three classes leveraging features from both the host level and

network level traces to extract a set of flow based features pertaining to the botnet command

flows. Several machine-learning based classifiers are trained with tagged training data, and the

model is used to detect bot activity when given untagged test traces.

BotHunter [25] is a detection system that models the botnet lifecycle stages – inbound scan-

ning, inbound infection, egg download, C2C communication and outbound scans. It com-

prises of a payload based anomaly detection engine based on 1 gram character distributions

and a TCP scan detection engine implemented as extensions to the SNORT Intrusion Detec-

tion System along with some custom rules to detect exploits, egg download and C2C traffic. A

correlation engine is used to combine the SNORT alerts and produce a final aggregated report

of Botnet Activity.

A more sophisticated extension to BotHunter called BotMiner was proposed in [24] where the

correlation engine in BotHunter is replaced by a clustering algorithm, in which similar ma-

licious flows are clustered. An additional clustering stage groups flows based on flow based

features such as flows per hour (fph), packets per flow (ppf), bytes per packet(bpf) and bytes

per second for an hourly window. The two clusterings are then correlated by checking cluster

intersections.

A graph-theoretic framework to isolate botnets was proposed by Jaikumar et. al. [29] This

work relies on the use of the botnet activity detectors such as SLADE in [25] and [24]. A

weighted graph is constructed ith hosts as nodes, and the existence of an edge and its weight

is determined based on the common expression of the botnet activity. These edge weights are

updated temporally according to a probabilistic model of the joint activity distribution of the

nodes. This weighted graph is then partitioned using a recursive spectral bisection technique

into clusters of nodes belonging to the same botnet.

Periodicity

Bots will periodically connect to the C2C to pull commands (for centralized botnets), or peri-

odically ping nodes in their peer list (distributed botnets)

Chapter 2. Botnets and Botnet Detection 17

Botnets need to leverage the power of the Domain Name Service to obtain the IP address of the

Command and Control Servers. Dagon [10] has proposed a method of detecting abnormally

high or temporally correlated DNS query rates by employing an outlier detection algorithm

based on the Mahalonobis distance. A method devised by Schonewille and Van Helmond

[60] detected bots based on recurring Domain Not Found responses, as these could be domain

names that have been taken down, or generated by a Domain Generation Algorithm.

Girore et al. [21] have developed a method to identify Command and Control flows by exploit-

ing temporal persistence. They operate on the destination end points of packets, extracting

‘atoms‘ which are a tuple of service, port and protocol. The monitor persistence by using a

sliding-window scheme, counting the presence of the atom within each window. An alarm is

raised if the persistence is above a threshold.

Group activity or temporal correlation

Bots that are a part of the same botnet will behave similarly, for instance in a network that has

multiple members of the same botnet, when a command is received, both hosts will respond

similarly, and separated by a very small time difference.

Strayer et al. [66] proposed an approach to detect IRC C2C Channels. The approach involved

narrowing down chat traffic from other traffic by filtering out unlikely traffic at the first stage,

this was followed by a stage which employed machine learning based classifiers to perform

flow based classification of the traffic into IRC or not based on flow characteristics such as

flow duration, congestion window size, average and variance bytes per packet(bpp), bits per

second (bps), packets per second(pps), variance in packet inter-arrival times. This classifier

was trained on IRC traffic. The final stage involved identification of temporally correlated

flows, as bots of the same botnet will exhibit similar response times. This method is tied to

IRC based botnets, further flow randomization strategies can easily defeat this approach.

Lu and Ghorbani [39] have presented a two-stage method that relies on a payload based clas-

sification followed by a novel cross association algorithm that uses character frequencies of

the 256 possible characters of each flow, using the k-means clustering algorithm, and returns

the cluster with a low standard deviation of the character frequencies, under the intuition that

Chapter 2. Botnets and Botnet Detection 18

human chat activity is very diverse as compared to bot activity. This approach is tied to IRC

based botnets, the need to do Deep Packet Inspection makes the method difficult to work at

high speed traffic. Encryption can easily defeat the features exploited.

Gu et al. have proposed BotSniffer [26] which performs spatio-temporal correlation of net-

work traffic to detect botnet command and control servers and infected nodes. It focusses on

IRC and HTTP traffic, and identifies if there is communication that exhibits group activity –

where a number of hosts send traffic within a given time using a score computed by a thresh-

old random walk based algorithm. The content similarity between the flows is measured via

n-gram analysis.

Choi et al. have presented BotGAD [4] which exploits the group activity exhibited by bots in

making DNS queries. In this work the similarity,periodicity and intensity of botnet DNS query

behaviour is exploited. The similarity between querying patterns is computed by standard

measures such as Kulczynski or Jaccard coefficient, the periodicity by the Euclidean distance,

bot hosts are then identified by appropriate thresholding of the three metrics. The method re-

lies only on DNS traffic, and is agnostic to the C2C protocol used by the botnet. However P2P

based botnets need not rely on the use of DNS and may escape detection.

P2P Botnet Detection

Detection of P2P traffic poses several challenges, P2P networks normally employ strong en-

cryption, and use random port numbers, including ports reserved for well known services and

try to blend in with normal traffic in order to avoid detection [63]. The methods described in

this section detect C2C flows, i.e. the flows of the P2P overlay network the bots are a part of.

Yen and Reiter have proposed a method to differentiate between file-sharing hosts and bots

[78]. In this work the differences between file-sharing hosts and bots such as large data vol-

umes, rapid churn associated with file sharing hosts and the temporal similarity of bots are

exploited. To detect temporal similarity, they construct a histogram for each host, and cluster

histograms on the basis of Earth Movers Distance.

Zhang et al. have described a method to detect stealthy P2P botnets [81]. Their approach is

a multi-stage approach which rely on reduction of flows by retaining flows of nodes which

Chapter 2. Botnets and Botnet Detection 19

exhibit many failed outgoing connections, followed by clustering using flow based features,

followed by filtering on the basis of temporal persistence, finally relying on the overlap of

peers among the nodes in the clusters and traffic similarities to identify P2P bots.

Jiang et al. have proposed a method to detect P2P botnets by discovering flow dependencies in

C2C traffic [32]. Dependencies of flows are extracted by identifying pairs of flows that occur

together many times in a given observation time, and the extracted two-level dependencies are

used to obtain higher level dependencies by combining flows. Flows are then clustered based

on the Jaccard similarity of the extracted flow dependencies.

Coskun et al. have described a graph-based method to identify other members of an unstruc-

tured P2P botnet within a network, when given a known bot[9]. A mutual contacts graph is

constructed, and a dye diffusion from the source node is simulated. Other members of the

same botnet are identified by thresholding on the final dye concentrations on the nodes.

2.5 Detection of Structured P2P Botnets in Large Scale Net-

works

Dagon et al. [11] carried out a graph-theoretic analysis of the effectiveness, efficiency and

robustness of the above topologies, modelling the above topologies as random graphs. It was

concluded that topologies based on structured P2P systems offer good resilience. Davis et al.

[13] analysed the performance of unstructured and structured P2P topologies, especially their

behaviour to random, tree-like and global information based disinfection strategies, and con-

clude that structured P2P topologies are ideal mechanisms for botnet command and control,

as they provide a good trade-off between efficiency and resilience. Most of the existing botnet

detection approaches were designed for and deployed in small networks, like a campus or an

enterprise. Botnets are a global threat to the security of the Internet, and the Internet Service

Providers (ISP)’s have good reason to be concerned, as their precious bandwidth is being put

to misuse, thus it is in their interest to rid the Internet of botnets.

As shown by Davis et al.[13] and with examples like the Storm Botnet and the indestructible

TDL-4 botnet, structured P2P based botnets can be a serious threat, and efforts must be made

Chapter 2. Botnets and Botnet Detection 20

for their detection and removal.

Jelasity et al. showed limitations of local approaches in the detection of structured P2P

botnets[30]. They show that the visibility of P2P botnet traffic can be made very small if

botnets adopt some strategies. They have proposed a new overlay topology based on the exist-

ing CHORD topology, but with clusters in such a way that the links touch the smallest possible

number of routers. They conclude that automated detection of P2P botnets can be achieved

only with cooperation among the major ISP’s and that future research should target the devel-

opment of large scale P2P detection algorithms.

The primary challenge in deployment of a botnet detector at the infrastructure level is the large

velocity of data, for which the methods discussed earlier in this section will not be effective

owing to their lack of scalability.

To handle such large volumes of data, an effective detection method must rely on primarily a

reduced or abstracted form of it such as a graph of hosts, with the presence of an edge between

two hosts if there is any data communication between them. The edge can be unweighted, and

independent of the protocol or size of the communication. Such an abstraction is very easy

to construct as very little of the packet needs to be looked at, and storage requirements are

reduced, as even header information is not stored.

The important question is whether this data retains enough features to enable the detection of

structured P2P botnets.

BotTrack, proposed by Francois et al. [20] is a method that works on a directed traffic graph

constructed from NetFlow traces, and computes the hub and authority centrality of each host,

and clusters hosts based on these values using the DBSCAN algorithm.

2.5.1 BotGrep

Nagaraja et al. have proposed BotGrep [44]. In this work structured P2P botnets are differen-

tiated from background traffic using only the connectivity features. It exploits the concept that

structured P2P botnet subgraphs are fast-mixing while the subgraph of normal or the back-

ground traffic is not. The state probability vector, associated with random walks on the graph

Chapter 2. Botnets and Botnet Detection 21

qt, whose each component represents the probability of being in vertex i after t steps, will con-

verge to the stationary distribution of the graph in a very small number of steps owing to the

expansion properties, which are absent in the topology associated with regular client-server

traffic. This is achieved in a two step algorithm which includes a fast prefiltering step, and

a relatively slower refinement step which aims at removing false positives. In this work, it

is assumed that honeynet nodes are available to distinguish between file-sharing traffic and

botnets.

• Prefiltering: The first stage of the algorithm runs short random walks of log2(N)

steps,where N is the number of nodes in the graph. The random walks are computed

according to the standard transition probability matrix Pij = 1di

if there is an edge from i

and j in the graph, where di is the degree or number of connections of a node. The state

probability vector qt is computed by qt = qt−1P . The resulting vector is proportional to

the degree of each node, and a quantity si =(qt

di

) 1r,(r is an input parameter, assumed

to be 100 in the paper) which penalizes the state probabilities of the high degree nodes

is used as a feature vector for the X-means clustering algorithm, which can automati-

cally determine the correct number of clusters. The X-means algorithm searches for the

appropriate number of clusters between 2 and kmax ( kmax is an input parameter and

assumed to be 20 in the paper). The prefiltering step determines the detection rate of the

algorithm.

• Refinement: The cluster from the prefiltering step containing the honeynet nodes is then

refined by a recursive application of a bisection algorithm. The bisection algorithm is

based on a probabilistic model defined on the basis of a set of traces T of random walks

of log2(N) steps. These traces are start and end vertices obtained by performing random

walks on a special transition probability matrix Pij = min(

1di, 1dj

)when there is an

edge between node i and j. The probabilistic model assigns a probability of generating

the current set of traces from a given set/ cut of nodes. Using the Bayes theorem, the

probability that the given set of nodes is a botnet is computed by using by drawing

samples from the probabilistic model using Metropolis-Hastings sampling.

Chapter 2. Botnets and Botnet Detection 22

As discussed in Chapter 1, Nagaraja et al. compared BotGrep[44] to several Community

Detection Algorithms (CDA) and concluded that CDA’s will not be able to scale well enough

to be able to handle this data. Community Detection Algorithms have received a lot of recent

attention, and there have been several algorithms proposed that can theoretically handle large

graphs. In the next chapter a survey of the popular Community Detection Algorithms will be

carried out with special emphasis on scalability, and experiments will be carried out in order

to identify a candidate algorithm that can be used to detect structured P2P botnets.

Chapter 3

Community Detection Algorithms

3.1 Introduction

There has been a growing interest in network science, various real world systems have been

modelled as graphs or networks. Apart from the existence of the power law degree distri-

butions and small-world properties, real world networks were also found to have tightly knit

clusters of nodes or communities [22]. Community Detection (or graph clustering) algorithms

aim to detect these communities/clusters of nodes, given the graph.

A structured P2P botnet should have a high internal connectivity among its members so as

to achieve robustness against targeted and random failures. This provides the motivation for

using community detection algorithms in order to detect them.

Over the years there has been tremendous activity in the field of community detection algo-

rithms and there have been a number of methods proposed. The aim of this chapter is to survey

some of the related work in literature,and identify a candidate method for structured P2P bot-

net detection. Graphs and related terminologies that will be used for the rest of the thesis is

provided in Section 3.2. Community Structure is introduced and some popular partition qual-

ity functions are defined in Section 3.3.Some of the popular classes of Community Detection

Algorithms are described in Section 3.4. Section 3.5 aims at identifying a candidate algorithm

that can be applied to detect structured P2P botnets.

23

Chapter 3. Community Detection Algorithms 24

3.2 Graphs

A graph (or network) G(V,E) is a set of vertices (or nodes) V , and a set of edges (or links).

The number of nodes (or order) of the graph is the number of elements of set V and is denoted

by |V |

The number of edges (or size) of the graph is the number of elements of set E and is denoted

by |E|

3.2.1 Edges, Directionality and Weights

An edge is a tuple (u, v) : u, v ∈ V , indicating that there is a connection between node u and

node v. The edge set E ⊆ V × V .

In general edges in a graph have directionality, i.e an edge from u to v need not imply v is

connected to u. A graph is said to be undirected if there is no directionality in any edge, and

there is no difference between (u, v) or (v, u)

A real valued number or weight can be associated with every edge. An unweighted graph has

no number associated with an edge, simply a boolean value of 0 or 1 indicating the presence

or absence of an edge.

Unless mentioned otherwise, all graphs in this thesis can be assumed to be undirected and

unweighted.

3.2.2 Adjacency Matrix, Degree and Transition Probability Matrix

An undirected and unweighted graph can be represented by a symmetric binary valued matrix

called the adjacency matrix.

Aij =

1 if there is an edge between i and j

0 otherwise

The number of connections or degree of a vertex is given by di =

∑jinV Aij

The transition probability matrix P associated with a graph is given by P = AD−1 where D

is the diagonal matrix with elements Dii = di. Each element represents the probability Pij of

a random walker to jump from vertex i to vertex j.

Chapter 3. Community Detection Algorithms 25

3.2.3 Degree Distributions and Random Graph Models and Assortativity

The distribution of the degrees of a graph or P (di = k) is the degree distribution of a graph.

3.2.4 Power Laws or Scale Free Networks

Most real world networks follow a power-law or a Pareto distribution where the degree of a

node is given by

P (di = k) = Ck−γ

where C is a constant and γ is an exponent that controls the ’skewness’ of the distribution. The

skewness of the degree distribution results in the presence of hubs which account for a large

amount of the edges of the graph. Such networks are also called scale-free.

3.2.5 Random Graph Models

There have been models proposed which aim to generate graphs via a random process. The

popular models are described here.

Erdos-Renyi(ER) Graph

The Erdos-Renyi Model or the ER model attaches a uniform probability p to every edge, thus

the probability of existence of an edge Pij = p.

The degree distribution of an ER graph is binomial

P (di = k) =

|V | − 1

k

pk(1− p)|V |−k−1

Configuration Model

The configuration model generates a graph given the degree sequence i.e the degrees of each

node. It is a process by which an equivalent random graph for a given graph can be created by

rewiring the edges. This rewiring process is done by considering each edge as two end stubs,

Chapter 3. Community Detection Algorithms 26

each free stub is then randomly connected to another free stub. The probability of an edge is

given by Pij =didj2|E|

3.2.6 Assortativity

The degree assortativity coefficient of a graph r was proposed by Newman[45] to study the

degree-degree mixing patterns in complex networks. The degree-degree mixing patterns indi-

cate whether an average node in the graph connects to other nodes of similar degree (assortative

mixing) or whether it connects to nodes of dissimilar degrees (disassortative mixing). It is the

Pearson correlation coefficient between the degrees of the endpoints of each edge in the graph,

and is given by

r =1

σ2q

∑jk

jk(ejk − qjqk) (3.1)

where, j and k are the degrees of vertices on either end of an edge, qk is the distribution of

excess degrees given as as qk = (k + 1)pk+1/∑

j jpj , pk is the probability of a randomly

chosen vertex to have degree k, σ2q is the variance of the distribution qk and ejk is the joint

probability distribution of the remaining degrees of the two vertices at the either end of a

randomly chosen edge.

3.2.7 Paths, Connected Components, and Betweenness Centrality

A path is a sequence of edges.A shortest path or geodesic path between two vertices s and t

is the smallest number of edges that have to be traversed to reach t from s.

A set of nodes C ⊆ V of a graph G(V,E) is a connected component if there is a path from

every node in the set to every other node in the set

The betweenness centrality of a node is given by

B(i) =∑

u,v∈Vσu,i,vσu,v

where σu,v represents the number of geodesic paths from node u to node v and σu,i,v represents

the number of geodesic paths from node u to node v through node i

Chapter 3. Community Detection Algorithms 27

3.2.8 Subgraphs, Covers and Partitions

A subgraph Gs(C,EC) corresponding to a set of nodes C ⊆ V of a graph G(V,E) is a graph

G(C,Es) consisting of edges Es = {(u, v) : u, v ∈ C, (u, v) ∈ E}

A cover P is a set of subsets of set V {Ci ⊆ V }i=1···k such that⋃i=1···k Ci = V

A partition P is a set of subsets of set V {Ci ⊆ V }i=1···k such that Ci ∩ Cj = φ∀i, j and⋃i=1···k Ci = V . It is thus a set of mutually disjoint subsets of V whose union gives the entire

set V .

3.3 Community and Community Structure

3.3.1 Definitions

Community: A community (or cluster or group) can be defined in several ways, and there is

no universally accepted definition. However it can be intuitively understood as a set of nodes

that are densely connected to each other, and relatively sparsely connected to the rest of the

graph.

Community Structure: The community structure of a graph is the set of communities in a

graph, it can be represented as a partition-where each node belongs to only one community. A

discussion of overlapping communities, where a cover of the graph is desired is out of scope

of this thesis.

A community detection algorithm thus looks to identify a partition P = {C1 · · ·Ck} of

the graph such that the nodes of each community are densely connected to each other, and

relatively sparsely connected to the rest of the graph.

3.3.2 Partition Quality Functions

A partition of the graph P has to be scored so as to quantify its quality. As there is no uni-

versally accepted definition of a community, there have been several measures to quantify the

quality of partitions. Some popular quantifications include – Cut, Ratio Cut, Normalized Cut

Chapter 3. Community Detection Algorithms 28

Figure 3.1: Communities in a Graph (Image from [31])

and Modularity.

Cut: – The number of inter-community edges in a partition of a graph.

Cut(P ) =∑C∈P

∑i∈C,j /∈C

Aij (3.2)

The partition of a graph into two communities that minimize cut (the min-cut problem) can be

solved in polynomial time by computing the max-flow [17]. The problem with this measure

is that there is no account taken of the internal density of the clusters, leading to imbalanced

trivial partitions of one node in one cluster and the other nodes in the other cluster.

Ratio Cut: The ratio cut overcomes the issue of high scores given to a partition by Cut to

unbalanced partitions, dividing the number of inter-community edges of each community by

the size of the cluster and the size of the rest of the graph

Chapter 3. Community Detection Algorithms 29

RatioCut(P ) =∑C∈P

∑i∈C,j /∈C Aij

|C| |V − C|(3.3)

The disadvantage of this is it does not consider the internal density of the community.

Normalized Cut [62]: Normalized Cut was proposed to account for the density of the cluster

as well by normalizing the number of inter-community edges of each community by the total

degree or volume of the community.

NormalizedCut(P ) =∑C∈P

∑i∈C,j /∈C Aij

vol(C)(3.4)

where V ol(C) =∑

i∈C di

The disadvantage of the above quality functions Cut, RCut, and NCut is the fact that they

all equal their highest value 0 when the partition consists of the whole graph as a community.

Modularity [48]: This quality function compares the community structure of a graph to a

random graph that is not expected to have any community structure. It compares the number

of internal edges in the graph to the expected number of edges in an equivalent null model. In

its most general form it can be written for a partition P as

Modularity(P ) =∑C∈P

1

2 |E|∑i,j∈E

(Aij − Pij) (3.5)

Where P is a null model. The null model typically used is the configuration model, which is a

random graph of the same degree sequence. This leads to the standard definition of modularity

Modularity(P ) =∑C∈P

1

2 |E|∑i,j∈E

(Aij −

(didj2 |E|

))(3.6)

Chapter 3. Community Detection Algorithms 30

The advantage of modularity is that it considers both the internal density as well as the external

sparsity of the community.

3.4 Community Detection Algorithms

A large number of community detection algorithms(CDA) exist, several of them have been

surveyed in [58, 18, 8]

The important classes of CDA will be surveyed in this section, with an emphasis on the effi-

ciency of the algorithm for large networks. The treatment of algorithms that deal with overlap-

ping communities (i.e when a cover of the graph is desired – a node can belong to more than

one cluster) is beyond the scope of this thesis.

Graph Partitioning Algorithms

These classes of algorithms primarily focus on min-cuts. i.e partitioning the graph into two or

more groups with the smallest number of edges between clusters. Another property of graph

partitioning algorithms is that they are generally designed to yield balanced clusters.

The Kernighan-Lin Algorithm [35] is such an algorithm, which takes in a random bisection of

the graph into two equal sized sets, and swaps vertices among these sets so as to minimize the

number of edges. The procedure takes O(|V |2 log |V |) time to yield a bisection of the graph.

To break the graph into more than two clusters, the procedure has to be applied recursively.

The FM-heuristic proposed by Fiduccia et al. [16] in reduces this to O(|E|) by considering

movement of a single node to neighbouring communities instead of node swaps.

METIS[34] is a popular graph partitioning algorithm that builds on the FM heuristic[16], and

integrates it into a multi-level algorithm. A multi-level algorithm works on creating a sequence

of coarse graphs obtained by grouping nodes and edges. The METIS algorithm coarsens the

graph by performing edge matching, followed by the FM-heuristic at the coarsest graph recur-

sively to obtain the required number of clusters, followed by expansion of the edges at each

level in the uncoarsening phase, to obtain the final cut.

Chapter 3. Community Detection Algorithms 31

The main drawbacks of the graph partitioning algorithms is the need to set the number of

clusters and a parameter to tweak the balance or the sizes of the clusters. This is difficult to

do in practice as multiple runs with different values of the number of clusters will potentially

increase the running time. Thus METIS and other such algorithms are not suitable candidates

for automatic community detection in large graphs.

Divisive Algorithms

Divisive algorithms begin with all the nodes of the graph in one community and iteratively di-

vide the communities by edge removal. Girvan et al. proposed the GN algorithm[22], which is

a seminal work in Community Detection Algorithms. The algorithm involves iterative removal

of edges with high edge betweenness. The edge betweenness is the sum of the betweenness

centrality of the nodes. It scores an edge high if a large number of shortest paths pass through

it. Such edges with high edge betweenness are likely to be bridge or intercommunity edges.

A hierarchical clustering is obtained that can be pictorially represented as a dendrogram. The

appropriate partition can be obtained by cutting the dendrogram at some level, based on some

partition scoring function such as modularity[48]. This method runs in time O(|E|2 |V |) for

unweighted graphs due to recomputation of the edge betweenness O(|E|) times, with each

computation involving an evaluation of all source shortest paths which runs in time , which is

much too slow for even medium size graphs.

Radicchi et al. [53] proposed a divisive algorithm that removes edges with low edge-clustering-

coefficient, which quantifies the number of triangles an edge is a part of, with the intuition

that intercommunity edges will participate in fewer triangles as compared to intracommunity

edges. The edge clustering coefficient is a local measure, and unlike all pair shortest paths,

can be computed more efficiently, with an overall running time of,and the algorithm again pro-

duces a hierarchical clustering.

Owing to their high time complexity, divisive algorithms are not good candidates to detect

communities in large graphs.

Chapter 3. Community Detection Algorithms 32

Spectral Algorithms

Spectral algorithms rely on the projection of the graph into the eigenspace, and rely on the

optimization of the relaxed version of the combinatorial optimization problem. A detailed in-

troduction to Spectral Clustering is found in [74]

Optimizing the Ratio Cut exactly is NP hard. An approximate solution can be obtained using

spectral methods. The second smallest eigenvector (also known as the Fiedler vector) of the

graph laplacian L = D − A, where D is the diagonal matrix of node degrees, can be used

to partition the graph into two groups such that ratio cut is minimized. The eigenvector can

be computed in O(|V |2 log |V |) time using the Lanczos[38] method. The bipartition can be

obtained by assigning all the nodes corresponding to positive components of the eigenvector

in one group, and the negative to the other. For more than one cluster, the procedure can be

repeated recursively on each cluster.

The problem of optimizing the Normalized Cut exactly in order to obtain communities is NP-

complete [62]. A spectral method based on the first k eigenvectors of the transition probability

matrix P can be used to obtain k communities that have minimum normalized cut. The k eigen-

vectors once obtained are used as feature vectors for the k-means clustering algorithm[40],

which provides the final communities.

The main drawback of the above spectral methods is the time complexity, as the computation of

the eigenvectors is slow, another important drawback is the need to set the number of clusters,

though some heuristics that consider the largest differences between successive eigenvalues

can be used [74].

Dhillon et. al. [14] proved the equivalence of spectral clustering and weighted kernel k-means,

and proposed a multi-level algorithm called GraClus to produce a clustering that optimizes nor-

malized cut, this brings down the time complexity of the algorithm to O(|E|). However a key

limitation of this and the other spectral algorithms is the need to specify the number of clusters.

Modularity can be optimized by a spectral relaxation as well. This was proposed by Newman[47].

Similar to the existing spectral methods, the leading eigenvector of the modularity matrix,

given by

Bij = Aij − didj2|E|

Chapter 3. Community Detection Algorithms 33

can be used to partition the graph into two communities based on the signs of the components.

The communities are further recursively divided applying the same method, stopping when

there is a decrease in the overall modularity of the partition. In the original work, a method to

improve the bisection was proposed by a modified version of the Kernighan-Lin method[35]

to optimize modularity. The spectral method has been demonstrated to produce partitions of

very high modularity. This method scales as O(|V | |E|), and can be used only for medium

sized graphs.

Random Walk Based Algorithms

Intuitively, owing to the large number of paths through internal vertices, a random walker will

spend more time within a community than outside it, this fact can be used to partition a graph

into communities.

Early work on random walk based algorithms were modifications of the divisive edge removal

algorithms. An edge betweenness measure based on random walks was defined by Newman

et al[46], which takes O(|V |3) time to compute, is prohibitive for large graphs.

Pons and Latapy proposed Walktrap[49], an agglomerative hierarchical clustering algorithm

based on a distance measure that is computed by performing short random walks. The distance

between two pairs of nodes is the euclidean distance of the rows of the transition probability

matrix corresponding to the two nodes. The hierarchical clustering agglomeration is done on

the basis of the Ward’s method, and the dendrogram is cut at that level that has maximum

modularity. The method scales as O(|V |2 log |V |).

Van Dongen proposed the Markov Clustering Algorithm (MCL)[71], which operates on the

transition probability matrix P = AD−1 of a graph. It relies on iterative application of two

operators on P - expansion and inflation. The expansion parameter raises the power of P to a

specified value t, which implies the matrix now gives the transition probabilities of a random

walker to reach to a vertex j from i in t steps. Under the intuition that for small values of t, the

random walker again is likely to be within the community. The inflation operation squares each

element of the matrix, and renormalizes each row to obtain a stochastic matrix. The inflation

step is meant to enhance the probabilities to intra-community vertices, allowing the random

Chapter 3. Community Detection Algorithms 34

walker to spend a longer time within the community. Intuitively the whole process iteratively

spreads flow on the graph, enhancing the probability for flow to accumulate within commu-

nities, and diminishes the probability of the flow to travel along inter-community edges. On

convergence, the connected components of the graph corresponding to the final matrix are the

communities. A naive implementation of the algorithm takes O(|V |3) time, an optimization

retaining only the top k non-zero elements at each time reduces this to O(|V | k2) time. A

multilevel version of this algorithm, MLR-MCL was recently proposed by Sataluri et al[57].

which further speeds up the computation by performing MCL only on the coarsest graph, al-

lowing MCL to be used for larger graphs.

Rosvall and Bergstrom have proposed Infomap[56], which partitions nodes by finding an op-

timal compression of random walk paths on the network. This is done by minimising the

average description length of random walks. The authors originally use a simulated annealing

based algorithm, which is too slow to be able to handle large graphs, but a recursive greedy

method, was also proposed to optimize this objective in time O(|E|).

Label Propagation

Raghavan et. al. have proposed a simple algorithm that detects community structure by prop-

agating labels iteratively[54]. Initially all nodes have different labels, then in each iteration,

each node gets the label corresponding to the one the majority of its neighbours have. For

each iteration the order of processing of nodes is randomized. The algorithm scales as O(|E|).

Label Propagation finds a partition that is a local optimum of∑

C∈P1

2|E|∑

i,j∈E Aij ,which is

modularity with a null model of Pij = 0∀i, j ∈ E, whose global optimum corresponds to all

vertices in one community[69].

Greedy Modularity Optimization Algorithms

Modularity and other objective functions can also be optimized using methods such as Simu-

lated Annealing, Genetic Algorithms, Extremal Optimization[18], and achieve solutions very

close to the global optimum. However these methods are very slow, and unsuitable for large

graphs. For handling large graphs, greedy algorithms that can quickly provide approximate

Chapter 3. Community Detection Algorithms 35

solutions have been proposed.

The CNM algorithm proposed by Clauset et. al. [6] does a hierarchical optimization of mod-

ularity. All nodes start in isolated communities initially, and in each stage neighbouring com-

munities are merged according to the best increase in modularity. The algorithm scales as

O(|E| log2(|V |)).

A faster multi-level algorithm, popularly known as the Louvain method was proposed by Blon-

del et al. [3]. It begins with each node in its own community, and instead of merging commu-

nities, iteratively moves nodes from one community to the one that gives the best increase in

modularity. On convergence, each community is collapsed to form a node, leading to a smaller

graph. This smaller graph is then subjected to the same method to obtain communities for the

next stage. The algorithm scales as O(|E|) and scales to very large graphs.

3.5 Efficiency Comparison of Community Detection Algo-

rithms

In order to handle the large graphs typically associated with traffic, the algorithm must scale

as O(|E|) or similar.Among the algorithms discussed in Section 3.4 the Louvain Method and

Label Propagation, Infomap and MLR-MCL are suitable candidates.

In order to test and compare the running times of the algorithms, benchmark graphs proposed

by Lanchichinneti et al.[37] are generated and used to test the CDA.

3.5.1 Dataset - LFR Benchmarks

The LFR benchmark graphs involve generating a network whose degrees follow a scale-free

or a Pareto distribution like most real world networks[7]. Edges are added according to the

configuration model. In the next step, nodes are assigned to communities randomly such that

the community sizes also follow a Pareto/Power Law distribution. To create the final bench-

mark each inter-community edge is rewired to form an intra-community edge with probability

1 − µ, where µ is the mixing parameter. We generate LFR graphs of various increasing sizes

Chapter 3. Community Detection Algorithms 36

and compare the running times of the selected algorithms in Figure 3.2

Figure 3.2: Comparison of CDA on LFR Benchmarks

Figure 3.3: Comparison of CDA on LFR Benchmarks - Louvain vs CNM

3.5.2 Discussion and Algorithm Selection

It can bee seen in 3.2 that though Label Propagation(LP) scales as O(|E|), its constants are

not as good as the Louvain, which is about 50 times faster. method, as observed in Figure.The

Chapter 3. Community Detection Algorithms 37

MLR-MCL algorithm works faster than Label Propagation, but still slower than the Louvain

method by about 15 times. Though the Infomap method is loosely based on a method similar

to Louvain, it has an additional step where it recursively applies the technique to the clusters

obtained thus making it slower than the original method by about 10 times.

The CNM (discussed in 3.4) algorithm was the fastest CDA tested by BotGrep [44]. Figure

3.3 compares the running time of the Louvain to the CNM method. The Louvain method is

found to be over 200 times faster than the CNM method. This further reaffirms the motivation

of finding scalable CDA for Structured P2P Botnet Detection.

Thus the results indicate that the Louvain method is the most efficient among the candidate

algorithms, and it is potentially a good candidate for application to detect structured P2P bot-

nets.

3.5.3 Conclusion

In this chapter various terminologies related to graphs have been defined. Community Struc-

ture was described, and several partition quality functions were reviewed. A brief survey of

the various classes of Community Detection Approaches was provided. A candidate set of

scalable Community Detection Algorithms for detection of Structured P2P Botnets was se-

lected, and compared on the standard community detection benchmarks in order to compare

their efficiency. The Louvain method was found to be the fastest among this set, and identified

as a potentially viable algorithm to detect Structured P2P botnets in terms of efficiency.The

next chapter further explores the suitability of the Louvain method by applying it to datasets

for Botnet Detection generated according to [44].

Chapter 4

Identifying Structured P2P Botnets using

the Louvain Method

4.1 Introduction

Community Detection Algorithms were introduced in Chapter 3 which surveyed various pop-

ular Community Detection Algorithms (CDA). A set of candidate algorithms that could the-

oretically scale well enough to be able to handle the large traffic graphs constructed from

backbone level traces were identified. These candidate algorithms were then tested on the

standard benchmarks for Community Detection Algorithms and the Louvain method emerged

as the fastest algorithm among them. The aim of this chapter is to apply the Louvain method

in order to identify structured P2P botnets. This will be done using only the topological infor-

mation contained in an undirected, unweighted graph constructed from packet traces.

The rest of this chapter is organized into 3 other sections. Section 4.2 will describe the Louvain

method in detail. Section 4.3 describes the procedure that will be used to generate graphs with

embedded structured P2P botnets for the purpose of experimentation. Section 4.4 describes

an application of the Louvain method on the generated data and discusses the applicability

of the Louvain method to detect structured P2P botnets. Section 4.5 describes the resolution

limit of modularity optimization, and describes the objective function Stability. The results of

the application of Stability optimization by the Louvain method on the generated datasets are

38

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 39

discussed in Section 4.6.

4.2 The Louvain Method

The Louvain method is a greedy modularity optimization algorithm proposed by Blondel et.

al. [3]. Modularity, as already discussed in Section 3.3 is given by -

QModularity(P, t) =1

2 |E|∑C∈P

∑i,j∈C

(Aij −

didj2 |E|

)

The method is a multi-level/multi-stage iterative process. It consists of two main steps – a

modularity optimization step, and a community aggregation step. The algorithm is defined

in Algorithm 1 and illustrated in Figure 4.1. In each level or stage, the objective function is

greedily optimized by iteratively moving single nodes to communities that yield the highest

gain. The greedy optimization will result in a local minima of the objective function, the

communities obtained are collapsed into a graph on which the same greedy optimization is run

to merge communities. This allows the algorithm to climb out of the local maxima.

Figure 4.1: The Louvain Method (Image from [3])

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 40

Algorithm 1: The Louvain MethodInput : The graph G

Output: The partition P of G into communities

begin

OldGraphSize←− |V |

Gw ←− G

P ←− φ

while OldGraphSize 6= |VGw | or |VGw | = |V | do

OldGraphSize←− |VGw |

Ptemp ←− GreedyOptimizationQModularity(Gw)

P ←− PartitionExpansion(G,Gw, Ptemp)

Gw ←− CommunityAggregation(G,P )

end

end

4.2.1 Greedy Modularity Optimization

Modularity (discussed in Section 3.3) is optimized by a greedy procedure in each stage of the

Louvain method.In this step, each node of the graph is initially assigned to its own community.

Then iteratively each node is removed from its community, and moved to that community that

provides the maximum gain in the modularity. This is outlined in Algorithm 2.

The time complexity of the algorithm is determined by the gain computation step. If the

computation of gain can be done in constant time, the time complexity of this step is O(|E|).

It was shown in [3] that the computation of gain in modularity for the movement of one node

i from its community to the community of a neighbouring node j is given by

∆QModularity(i, j) =1

|E|(IntEdgesNode(Cj, i)− IntEdgesNode(Ci, i)−

di2 |E|

(V ol(Ci)− V ol(Cj))

(4.1)

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 41

Algorithm 2: GreedyOptimizationQModularity

Input : The graph G

Output: The partition P of G into communities

begin

initialize P such that each node is in its own community

Qprev ←−Modularity(P )

Qcurrent ←− 1

while Qcurrent > Qprev do

for node i ∈ V do

MaxGain←− 0

for node j adjacent to i do

MaxGain←− max(MaxGain,∆QModularity(i, j))

end

move the node i to that community that results in max gain

end

end

end

where V ol(C) =∑

i∈C di is the sum of degrees of the nodes in the community C and

IntEdgesNode(C, i) =∑

k∈C Aik is the number of edges node i has to other nodes in commu-

nity C. The V ol(C) term can be maintained for each community inO(|P |) space, and updated

in constant time, while IntEdges(C, i) can be computed during the pass along the neighbours

of the node i, thus enabling the gain to be computed inO(1) time. This implies that the Greedy

step will scale as O(|E|) when used to optimize modularity.

4.2.2 Community Aggregation

The application of just the objective optimization step results in solutions of a large number of

small communities as a local optimum is reached. In order to climb out of the local optimum,

the small communities must be merged.

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 42

In this step, each community in the partition is collapsed into a node. A weighted graph is then

constructed with the collapsed communities, with self loops added to account for the internal

edges of each community. Edges connecting several nodes of the same community to the same

neighbouring community are collapsed by setting the edge weight as the number of such edges.

This step can be implemented efficiently in O(|E|) time, as all the edges have to be traversed

only once. Thus the overall time complexity of the Louvain method is O(|E|), as the number

of iterations are usually small.

4.3 Dataset Generation

In order to evaluate the applicability of the Louvain method for structured P2P botnet detection,

it is tested on graphs constructed from real world traffic traces of different durations with

synthetic botnet topologies embedded in them. The background, botnet graph construction

and embedding are done according to the procedure followed by Nagaraja et. al [44] in their

evaluation of BotGrep. The description of this dataset generation process is provided in this

section.

4.3.1 Network Traces

Network Traces either include dumps or logs of every packet sent, or are in the format of

NetFlow. NetFlow[5] is a format for exporting Network Data. A router capable of exporting

data in the NetFlow format aggregates all the packets with the same 5-tuple of Source IP

Address , Destination IP Address, Source Port, Destination Port and Transport Layer Protocol

(TCP or UDP) i.e a flow, and returns this along with the timestamps of the duration of the flow,

the number of packets and number of bytes sent. NetFlow traces captured at the core routers

of the Abilene ISP, and packet traces captured at an Internet Point of Presence (PoP) provided

by CAIDA are used for evaluation.

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 43

Abilene NetFlow Traces

NetFlow traces captured at three core routers of the Abilene ISP (Now the Internet2 Network)

located at the Salt Lake City UT, Washington D.C, and Chicago IL. The data was captured

over 1 day on the 1st of December 2008 and has no payload information. The three traces will

be referred to as SALT, WASH and CHIC. Apart from the entire 1 day trace, a 1 hour slice

is also extracted. The data is aggregated as /24, i.e the last 8 bits of the Internet Protocol (IP)

address has been zeroed out for anomymization purposes.

CAIDA Traces

A larger set of packet traces captured at a OC192 (10Gbps) Internet Point of Presence (PoP).

The data was captured for 1 hour between 13:00 and 14:00 on the 17th of February 2011 on a

backbone link between Chicago and San Jose City as a part of the 2011 Anonymized Internet

Traces Dataset [75] and contains no payload information.

4.3.2 Background Graph Construction and Properties

The graph is constructed from only the information extracted from the Internet Protocol(IP)

layer of the network traces. Only the source address and destination address fields of each

packet of the trace are parsed. The nodes are IP addresses, and there is an edge added between

nodes A and B if there is a packet sent from an IP address A to IP Address B or vice-versa,

i.e the directionality and the number of packets, number of bytes, the port numbers etc. are

ignored to create an undirected and unweighted graph with no self loops and no multiple edges.

The following Table 4.1 contains the details of the graphs extracted from the network traces

described in Section 4.3.1. These graphs will be used as the background graphs for the rest of

the chapter and the thesis. The mean degree of the graph µ(G) =∑

iinV di is also computed.

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 44

Background Graph |V | |E| Mean Degree

WA-1H 119203 967002 16.22

CH-1H 206390 1639800 15.89

WA-1D 217504 5188343 47.71

CH-1D 297732 12156830 81.66

CA-CH-1H 2716915 8867080 6.53

CA-SJ-1H 8427335 28109401 6.67

Table 4.1: Properties of Graphs extracted from the Network Traces

Densities of the Background Graphs

The Abilene graphs (Mean degree > 15) are much denser than the CAIDA graphs (Mean

Degree ≈ 6). This is due to the /24 aggregation, which results in a smaller number of hosts.

This density difference is important, as Community Detection Algorithms roughly depend

on the fraction of internal edges of a community to the fraction of external edges, thus the

above datasets provide background graphs with a wide range of densities for evaluation of the

applicability of the Louvain method. Among the Abilene graphs the 1-Hour and the 1-Day

show variation in density as 1 Hour traces have Mean degree < 17, while the 1 Day traces

have Mean Degree > 40.

4.3.3 Structured P2P Graph Generation

Since the focus is on modelling and exploiting the connectivity features of only the Command

and Control (C2C) communication of a structured P2P botnet we generate graphs of popular

Distributed Hash Tables (DHT) KADEMLIA, CHORD and KOORDE (discussed in Section

2.3.2) based on the routing table stored at each peer, i.e an edge is added between two hosts A

and B if A has an entry of B in its routing table as done in [44]. Graphs of the three topologies

of sizes 1000,10000, and 100000 nodes are generated for the analysis.

CHORD: For a graph of N nodes, links are added using the rule that each node i will con-

nect to nodes i − 1, and i + 1 to complete the ring, and have long-range links to nodes

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 45

i+ 2kmodN for k = 1...(log2N)− 1. Thus each node has a degree of log2N peers.

KOORDE: DeBruijn graph of constant degree 10 is generated to represent the KOORDE

topology. A graph of 10L nodes consist of all L sized strings of an alphabet with 10 characters

are generated, and an edge is added from one string to another if they differ by exactly one

character in any single location.

KADEMLIA: In order to construct the topology graph, for every node, the partition of the

space into buckets is generated according Section 2.3.2 and edges are added randomly be-

tween the node and a random node in each bucket.

4.3.4 Embedding the Botnet

In this analysis, a single botnet graph is embedded in one of the background graphs to create

a single dataset. The botnet graphs are embedded in the background graph by mapping each

node to a node in the background uniformly at random. Then the edges in the botnet graph

are added, in addition to the edges of the background graph for all the nodes in order to create

the final superimposed graph. Thus each bot node will also have edges to other benign nodes

in addition to other bots. The mapping of bot nodes to the background serves as ground truth

for validation. An example superimposed graph is depicted in Figure 4.2. For our experiments

botnet graphs of different sizes described in Section 4.3.3 are embedded in the different back-

ground graphs described in Section 4.3.2 to create the final datasets for analysis. The number

of distinct datasets (One botnet graph per one background graph) is equal to 6 (Number of

Background Graphs)∗6 (Number of Botnet Topologies) = 36.

Dataset Naming Convention

A single dataset or instance consists of a background graph described in Table 4.1, and a single

botnet topology from Section 4.3.3, with the embedding done as described in the previous

section. The naming convention followed for the rest of the thesis is

<Background Graph>-<Botnet Topology (CHO|KOO|KAD)>-<Botnet Size>

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 46

Figure 4.2: An example of a graph with a superimposed botnet (Image from [44])

Thus a dataset named as CH-1H-CHO-1000 implies that the dataset is a 1000 node CHORD

graph embedded in a background graph extracted from the 1 hour trace captured at the CHIC

router from the Abilene ISP.

4.4 Application of the Louvain method to identify structured

P2P botnets

In this section the Louvain method is applied to detect communities in the graphs generated

according to Section 4.3. BotGrep [44] (described in Section 2.5.1) is also run on the same

datasets with the input parameters set according to the reference.

4.4.1 Datasets and Evaluation

Datasets

The Abilene 1 Hour and 1 Day Traces are considered for this experiment. Thus the WASH-1H,

CHIC-1H, WASH-1D, CHIC-1D graphs are used as the background (Section 4.3.2). Botnets

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 47

of each of the three topologies of sizes 1000 and 10000 (Section 4.3.3) are embedded, which

correspond to the percentage of botnet nodes being within 0.1%-8%. Thus a total of 24 sets

(4 ∗ Background Graph + 6 ∗ Botnet Graph) are generated and are named according to the

convention in Section 4.3.4.

Metrics

The Louvain method is an unsupervised method and outputs a set of communities. In order

to identify the community, as in [44] it is assumed that honeypot information, i.e known bots,

and that the honeypot nodes belong in the cluster with the largest number of bots

Thus the purity of the cluster C with the largest number of the bot nodes is evaluated. Since

there is only one botnet graph embedded in the background graph, it becomes a binary classi-

fication problem. Let the set B be the set of the actual bots(obtained from ground truth), and

C be a detected cluster then the accuracy is quantified by Precision, Recall and FScore.

PrecisionP =|B ∪ C||C|

(4.2)

Precision represents the purity of each community by considering the fraction of botnet nodes

to the total nodes in the community.

RecallR =|B ∪ C||B|

(4.3)

Recall quantifies the detection rate , counting the total fraction of bots identified.

FScoreF =2× Precision×RecallPrecision+Recall

(4.4)

FScore is the harmonic mean of Precision and Recall and summarises both values.

4.4.2 Results and Discussion

The Louvain method is run on each of the 24 graphs considered, and the Precision , Recall

and FScore for the cluster with the largest number of bots are computed. BotGrep is also

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 48

implemented and run on the same datasets. The input parameters are assumed according to the

associated publication[44]. The cases of the 1 hour and 1 day background graphs are discussed

separately.

Abilene 1 Hour Trace Graphs

Tables 4.2a and 4.2b show the values of Precision and Recall of the Louvain Method and

BotGrep for the different botnet graphs embedded in the background graphs WA-1H and CH-

1H extracted from the 1 Hour Abilene Traces. Figures 4.3a and 4.3b plot the FScore of the

Louvain Method and BotGrep. The measures Precision, Recall and FScore are all computed

according to Section 4.4.1.

In the case of WASH-1H datasets (Figure 4.3a) indicate that the Louvain method performs

only slightly inferior to BotGrep in terms of FScore, BotGrep achieves an FScore of 0.95, while

the Louvain method achieves about 0.9 in most cases. Table 4.2a indicates that the Recall of

the Louvain method (> 97 in most cases) is comparable to BotGrep (Recall between 0.94-

0.98). The lower FScore achieved by the Louvain Method is because of the lower Precision

(0.9-0.95) as compared to BotGrep which has Precision 1.0 in the majority of the datasets

In the CHIC-1H datasets Figure 4.3b indicates that BotGrep achieves poor FScores of less

than 0.1 in some cases. This in comparison to its performance on WASH-1H (Figure 4.3a)

indicates that the algorithm is very sensitive to the input parameters. Figure 4.3b indicates

that the Louvain method again achieves about FScores of 0.9 in most cases as in WASH-1H.

It can be observed from Table 4.2a that BotGrep performs very poorly owing to small values

of Recall (< 0.5 in most cases). In the case of the Louvain method, the Precision as well as

Recall is > 0.9 in most cases.

Thus in the WASH-1H datasets, the performance of the Louvain method is comparable to that

of BotGrep and in the CHIC-1H datasets, the performance of the Louvain method is found to

be better than that of BotGrep. Thus in the case of the Abilene 1 Hour background graphs, the

Louvain method is able to identify the embedded structured P2P botnets and the performance

is found to be comparable or better than BotGrep.

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 49

(a) WASH-1H (b) CHIC-1H

(c) WASH-1D (d) CHIC-1D

Figure 4.3: Performance of the Louvain Method on Abilene Trace Graphs

Abilene 1 Day Trace Graphs

In the case of WASH-1D datasets (Figure 4.3c) indicate that BotGrep outperforms the Louvain

Method. BotGrep achieves an FScore > 0.85 in most cases, while the Louvain method per-

forms very poorly, achieving only about 0.2 in most cases. Table 4.2c indicates that poor per-

formance of the Louvain method is due to the precision values ( < 0.2 in all cases) as opposed

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 50

Botnet %BotsBotGrep Louvain

P R P R

CHO-1000 0.84 1.00 0.95 0.90 0.96

CHO-10000 8.39 1.00 0.95 0.95 0.95

KOO-1000 0.84 1.00 0.97 0.90 0.95

KOO-10000 8.39 0.99 0.94 0.96 0.93

KAD-1000 0.84 1.00 0.96 0.90 0.93

KAD-10000 8.39 0.99 0.95 0.84 0.92

(a) WA-1H

Botnet %BotsBotGrep Louvain

P R P R

CHO-1000 0.48 1.00 0.34 0.99 0.94

CHO-10000 4.85 1.00 0.03 0.98 0.95

KOO-1000 0.48 0.00 0.00 0.99 0.94

KOO-10000 4.85 1.00 0.08 0.80 0.94

KAD-1000 0.48 1.00 0.54 0.99 0.94

KAD-10000 4.85 1.00 0.34 0.93 0.92

(b) CH-1H

Botnet %BotsBotGrep Louvain

P R P R

CHO-1000 0.46 1.00 0.77 0.01 0.87

CHO-10000 4.60 0.98 0.79 0.14 0.88

KOO-1000 0.46 1.00 0.61 0.01 0.88

KOO-10000 4.60 1.00 0.76 0.13 0.85

KAD-1000 0.46 1.00 0.77 0.01 0.87

KAD-10000 4.60 0.98 0.80 0.14 0.89

(c) WA-1D

Botnet %BotsBotGrep Louvain

P R P R

CHO-1000 0.34 1.00 0.96 0.01 0.82

CHO-10000 3.36 1.00 0.35 0.09 0.80

KOO-1000 0.34 1.00 0.97 0.01 0.82

KOO-10000 3.36 1.00 0.40 0.08 0.76

KAD-1000 0.34 1.00 0.54 0.01 0.82

KAD-10000 3.36 1.00 0.43 0.07 0.70

(d) CH-1D

Table 4.2: Comparison of Louvain Modularity and BotGrep on Abilene Traces

to BotGrep which has Precision 1.0 in most cases. The Recall of the Louvain method(0.6-0.8)

is lower than that of BotGrep (0.85-0.9).

In the CHIC-1D datasets, Figure 4.3b indicates that BotGrep outperforms the Louvain method

again, but still performs poorly in this dataset. BotGrep achieves FScores between 0.4-0.7 in

most cases.The Louvain method achieves about FScores < 0.2 in all cases. It can be observed

from Table 4.2d that BotGrep performs poorly again owing to the low values of Recall (< 0.5

in most cases), indicating parameter sensitivity, but the Precision is 1 in all cases. In the case

of the Louvain method, the Recall is between 0.7-0.85, but shows poor FScores owing to the

lower Precision values (< 0.2 in all cases).

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 51

Thus in both the WASH-1D and the CHIC-1D datasets, the Louvain method detects a large

number of benign nodes apart from the bots, yielding larger communities than desired. The

performance is found to be much poorer than BotGrep.

Summary

The results of the application of the Louvain method on the Abilene background graphs is sum-

marised in Table 4.3. From the contrasting results of the application of the Louvain Method on

Background Density Modularity vs BotGrep

SparseWA-1H Comparable

CH-1H Better

DenseWA-1D Worse

CH-1D Worse

Table 4.3: Performance Summary of Modularity Optimization using the Louvain Method, with

reference to BotGrep

the Abilene 1 Hour and the denser Abilene 1 Day graphs, it can be concluded that the Louvain

Method is strongly affected by the density of the Background Graph. This is owing to the

Resolution limit defect present in Modularity Optimization methods in general. The resolu-

tion limit suffered by Modularity optimization algorithms, and a countermeasure proposed in

literature shall be discussed in the next section.

4.5 Community Detection at different resolutions and Mul-

tiresolution Modularity

4.5.1 Resolution Limit and Multiresolution Modularity

Modularity Optimization has been shown to suffer from a resolution limit by Fortunato and

Barthelemy[19]. Fortunato and Barthelemy show that the communities detected by modular-

ity optimization methods such as the Louvain Method may be large and made up of loosely

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 52

connected smaller but denser communities. These smaller communities thus will not be iden-

tified by the Modularity optimization method. It was also shown that a subgraph is considered

as a good community according to modularity if

IntEdges(C) >V ol(C)2

2|E|(4.5)

where V ol(C) =∑

i∈C di is the sum of degrees of the nodes in the subgraph C and

IntEdges(C) =∑

i,j∈C Aij is the number of internal edges in subgraph C

The above condition is therefore a function of the Internal number of edges of the subgraph,

as well as the number of connections the subgraph has to the rest of the nodes in the graph.

Thus the size can be controlled by modifying the definition of Modularity to incorporate a pa-

rameter that controls the internal density of the community, allowing it to detect communities

at multiple resolutions or scales.

4.5.2 Stability and Stability Optimization

Several equivalent definitions of multiresolution modifications to modularity have been intro-

duced in order to explore communities at various scales or sizes and have been surveyed by

Traag et al. [70]. This chapter will focus on the Stability proposed by Lambiotte et al. [36],

which incorporated the resolution parameter t into the definition of Modularity to obtain

QStability(P, t) =1

2 |E|∑C∈P

∑i,j∈C

tAij −didj2 |E|

(4.6)

With the introduction of the weighting parameter t in Equation 4.6, The condition 4.5 becomes

IntEdges(C) >V ol(C)2

2|E|t(4.7)

which is the condition that a subgraph C is a community according to Stability at a given value

of t. Thus as the value of t is increased, larger communities are given preference by Stability

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 53

as the internal edges IntEdges(C) can now be smaller than V ol(C)2

t. Conversely as t is de-

creased, smaller communities are preferred as the number of internal edges as compared to the

total degree of the community must be higher. At t=1, the original definition of Modularity is

obtained. Stability can be optimized using the Louvain method [59] with trivial modifications

to Equation 4.1 keeping the time complexity is unaltered.

4.6 Optimization of Stability using the Louvain Method to

identify Structured P2P Botnets

?? It is concluded in Section 4.4.2 that the density of the background affected the performance

of Modularity Optimization using the Louvain Method to detect structured P2P botnets. In

Equation 4.5, the total degree or V ol(C) of a subgraph C is a function of the density of the

background graph, thus as the background graph gets more denser, the number of internal

edges needed by a subgraph to be a valid community by modularity will increase. Thus if the

botnet graph is kept constant and the density of the background increases, the botnet graph will

tend to be detected inside a much larger community containing many benign nodes as well.

This provides an explanation of the failure of the Louvain method in detecting the embedded

structured P2P botnet subgraphs in the case of the denser Abilene 1 Day Traces.

Therefore Stability optimization by the Louvain Method could be used in order to detect struc-

tured P2P botnets in the case of the denser 1 Day Abilene Graphs by setting the value of t

¡ 1, which will force Stability to prefer smaller denser communities(Equation 4.7). Thus the

experiments in Section 4.4 can be repeated by replacing modularity as the objective function

with Stability.

In this section the the Louvain method shall be used to optimize stability at different values of

t in order to identify structured P2P botnets.

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 54

4.6.1 Datasets and Evaluation

As in 4.4.1 The same datasets used for the experiments in Section 4.4 are considered. The

Abilene 1 Hour and 1 Day Traces are considered for this experiment thus the WA-1H, CH-

1H, WA-1D, CH-1D graphs are used as the background 4.3.2. Botnets of each of the three

topologies of sizes 1000 and 10000 (Section 4.3.3) are embedded.The accuracy is quantified

by Precision, Recall and FScore as per Section 4.4.1

4.6.2 Results and Discussion

The Louvain method in order to optimize Stability at different values of t is run on each

of the 24 graphs considered, and the Precision , Recall and FScore for the cluster with the

largest number of bots are computed. The values of t chosen are between 2−6 = 0.015625 to

2−2 = 0.25 in exponentially increasing steps of 4. The Figure 4.4 shows the FScore achieved

for the different values of t, and Table 4.4 shows the Precision and Recall values.

Precision

From Table4.4 a global pattern can be observed that for all background graphs(WA-1H,CH-

1H,WA-1D and CH-1D), for all topologies(CHORD, KOORDE and KADEMLIA) for all sizes

of botnets(1000 and 10000) and for all values of t (0.25,0.0625,0.015625) the Precision is very

high (above 0.85). This indicates that Stability Optimization produces pure communities at

values of t ≤ 0.25.

Recall and FScore

From Table4.4 and Figure 4.4 it can be observed that there are large differences in the Recall

achieved by Stability Optimization. The discussion is broken up according to the value of t

and further broken up according to the size of the Botnet.

• t=0.015625: For this value of t, it can be observed that the Recall is low (< 0.5) across

backgrounds, botnet topologies and sizes. Consequently the FScore for this value of t is

also low (< 0.5 in all cases)

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 55

(a) WASH-1H (b) CHIC-1H

(c) WASH-1D (d) CHIC-1D

Figure 4.4: Performance of Stability Optimization on Abilene Trace Graphs

• t=0.0625: For this value of t, it can be observed that there are differences in Recall

depending on the size of the botnet

– Small Botnets: It can be observed that the Recall is high (> 0.9 in most cases) for

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 56

Botnet %Botst = 2−6 t = 2−4 t = 2−2

P R P R P R

CHO-1000 0.84 0.98 0.16 0.93 0.54 0.93 0.98

CHO-10000 8.39 0.98 0.01 0.98 0.04 0.89 0.24

KOO-1000 0.84 0.98 0.18 0.97 0.96 0.93 0.98

KOO-10000 8.39 0.87 0.01 0.85 0.02 0.98 0.41

KAD-1000 0.84 0.97 0.20 0.93 0.57 0.85 0.97

KAD-10000 8.39 0.98 0.01 0.92 0.04 0.86 0.17

(a) WA-1H

Botnet %Botst = 2−6 t = 2−4 t = 2−2

P R P R P R

CHO-1000 0.48 0.99 0.27 0.97 0.97 0.94 0.97

CHO-10000 4.85 0.97 0.02 0.96 0.07 0.97 0.28

KOO-1000 0.48 0.99 0.35 0.98 0.98 0.93 0.97

KOO-10000 4.85 0.99 0.02 0.96 0.03 0.96 0.66

KAD-1000 0.48 1.00 0.22 0.90 0.98 0.73 0.97

KAD-10000 4.85 0.99 0.02 0.95 0.08 0.86 0.32

(b) CH-1H

Botnet %Botst = 2−6 t = 2−4 t = 2−2

P R P R P R

CHO-1000 0.46 0.97 0.37 0.97 0.97 0.98 0.93

CHO-10000 4.60 0.98 0.03 0.99 0.08 0.98 0.39

KOO-1000 0.46 0.99 0.68 0.99 0.98 0.99 0.95

KOO-10000 4.60 0.98 0.02 0.93 0.04 0.98 0.89

KAD-1000 0.46 0.98 0.22 0.86 0.98 0.97 0.93

KAD-10000 4.60 0.99 0.02 0.96 0.09 0.97 0.31

(c) WA-1D

Botnet %Botst = 2−6 t = 2−4 t = 2−2

P R P R P R

CHO-1000 0.34 0.98 0.35 0.94 0.95 0.90 0.85

CHO-10000 3.36 1.00 0.05 1.00 0.13 0.99 0.91

KOO-1000 0.34 0.99 0.84 0.96 0.95 0.99 0.93

KOO-10000 3.36 0.96 0.02 0.98 0.07 0.38 0.87

KAD-1000 0.34 0.99 0.39 0.97 0.95 0.96 0.86

KAD-10000 3.36 0.99 0.04 0.94 0.09 0.96 0.58

(d) CH-1D

Table 4.4: Optimizing Stability at different values of t on Abilene Traces using the Louvain

Method

the 1000 node botnet graphs of all topologies (CHORD, KOORDE and KADEM-

LIA), correspondingly it can be observed that the FScores are high for the 1000

node botnet graphs(> 0.9).

– Large Botnets: It can be observed that the Recall is very low for the 10000 node

botnet graphs (< 0.15). Correspondingly it can be observed that the FScores are

low for the 10000 node botnet graphs (< 0.2).

• t=0.25: For this value of t, it can be observed that the trend is very similar to that

observed with t = 0.0625,with low Recall and FScores for the 10000 node botnets, and

high Recall and FScores for 10000 node Botnets. However Stability Optimization at

t = 0.25 outperforms the case of t = 0.0625 in the case of the 10000 node botnets

(Recall > 0.3 and FScore > 0.4) and performs only slightly worse than the case of

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 57

t = 0.0625 in the case of 1000 node botnets and is thus the most consistent among the

values experimented.

Comparison with BotGrep

It was observed that Stability Optimization at t = 0.25 using the Louvain Method provided

the most consistent results among the values of t tested. In Figure 4.5, the FScore achieved by

Stability Optimization(t=0.25) is compared to the FScore achieved by BotGrep on the Abilene

Trace Graphs. This is plotted in Figure 4.5. From Figure 4.5 it can be observed that Stability

Optimization(t=0.25) is comparable to BotGrep in terms of FScore in the CH-1H and CH-

1D datasets. However in the case of WA-1H (4.5a), Stability Optimization(t=0.25) performs

poorly for the 10000 node botnets and comparable in the case of the 1000 node Botnets. In the

case of WA-1D (4.5c), Stability Optimization(t=0.25) outperforms BotGrep in the 1000 node

botnet graphs, and performs poorer than BotGrep in the case of the 10000 node botnet graphs.

Conclusion

Thus from the results discussed above, it can be concluded that Stability Optimization is more

sensitive to the size of the botnet graph, and less sensitive to the density of the background

graph. As it is sensitive to the size of the botnet graph, it will be difficult to identify a single

value of t that will be able to detect all sizes of botnets. Thus the Louvain method to optimize

Stability has to run at different values of t, increasing the computational complexity of the

process. Further there has to be access to some method which will be able to terminate the

parameter search for t by identifying that a community detected is maximally botnet.

4.7 Summary

In this Chapter, the Louvain Method is discussed in detail. The dataset generation procedure is

described, which included the embedding of a botnet graph on to a background graph.Undirected

unweighted background graphs were constructed from real world Network Traces with nodes

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 58

(a) WASH-1H (b) CHIC-1H

(c) WASH-1D (d) CHIC-1D

Figure 4.5: Comparison of Stability Optimization(t=0.25) and BotGrep on Abilene Trace

Graphs

as IP Addresses, and the presence of an edge between two nodes indicating that a packet has

been sent. The background graphs were of different densities based on duration of capture of

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 59

Botnet SizeBackground Louvain Method

Density Modularity vs BotGrep Stability(t=0.25) vs BotGrep

Small SparseWA-1H Comparable Comparable

CH-1H Better Better

Small DenseWA-1D Worse Better

CH-1D Worse Comparable

Large SparseWA-1H Comparable Worse

CH-1H Better Worse

Large DenseWA-1D Worse Worse

CH-1D Worse Worse

Table 4.5: Performance Summary of Modularity Optimization vs Stability Optimiza-

tion(t=0.25) using the Louvain Method, with reference to BotGrep

the Network Trace. The botnet graphs were generated to model structured P2P based Com-

mand and Control Flow.Edges were added according to the routing table at each node. The

Louvain method is used to optimize Modularity in order to detect the embedded structured

P2P botnet graphs from the created dataset. It was then found that Modularity optimization

was sensitive to the density of the background. In order to overcome this dependence, the Lou-

vain method was then used to optimize Stability at various values of the resolution parameter

t < 1 on the created datasets. It was found that the value of t = 0.25 was the most consistent

across the datasets. It is found that Stability Optimization is sensitive to the size of the Botnet.

It is concluded that several runs of the method at different values of t had to be run in order to

detect all sizes, increasing the computational complexity. Both the Modularity Optimization

and the Stability Optimization methods were compared with BotGrep on the same datasets,

the results of this are summarised in Table 4.5

It is found that neither Modularity Optimization nor Stability Optimization were robust and

general. However the optimization of Stability at values of t < 1 is found to result in small

but homogeneous communities, this property is further explored in the next chapter in order

to arrive at alternative approach that is robust and efficient in the detection of structured P2P

Chapter 4. Identifying Structured P2P Botnets using the Louvain Method 60

botnets.

Chapter 5

A robust algorithm for identification of

Structured P2P Botnets

5.1 Introduction

In Chapter 4 the Louvain method is used to optimize Modularity and Stability in order to

detect structured P2P Botnets. It is observed that both methods had shortcomings in terms

of sensitivity to the density of the background or the size of the botnet. It is also observed

that Stability Optimization at values of the resolution parameter t < 1 resulted in small and

homogeneous communities of either only botnet nodes or only benign nodes. This property

can be exploited if a technique is developed in order to filter out the benign communities. In

this chapter a robust algorithm to detect structured P2P botnets based on the filtering of the

benign communities is proposed.

The method of obtaining the homogeneous communities robustly and efficiently is discussed

in Section 5.2. In order to identify the bot communities, they must be differentiated from

the communities of benign nodes. Section 5.3 defines a novel measure, mean regular degree

or mreg which is able to exploit the properties of structured P2P botnet graphs in order to

distinguish them from benign communities. Section ?? describes the proposed algorithm that

combines the use of greedy community detection and community filtering using mreg in order

to identify nodes that are a part of a structured P2P botnet.The The performance and robustness

61

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 62

of proposed algorithm is comprehensively validated in Section 5.5 and found to be comparable

in accuracy to BotGrep [44] in most cases, and found to be significantly faster in runtime.

5.2 Obtaining Small and Homogeneous Communities

As discussed in Chapter 4 it is observed that Stability Optimization at values of the resolu-

tion parameter t < 1 resulted in small and homogeneous communities, however there were

inconsistencies in the performance of Stability Optimization at among different values t for

different sizes of botnets. The search for the correct value of t would increase the compu-

tational time, as the Louvain method will have to be run several times at different values of

t. Thus a non-parametric objective function which behaves similarly to Stability at values of

t < 1 is desired.

5.2.1 An alternative Objective Function - Qw−log−v

In a recent paper by VanLaarhoven and Marchiori [73] the performance of various objective

functions optimized by the Louvain method to study their resolution biases. In this work they

also introduce a new objective function Qw−log−v. The Louvain method is a simple, general

algorithm that can be used to optimize objective functions other than Modularity and Stability.

The authors observed that optimizing Qw−log−v objective function using the Louvain method

produces a partition with communities smaller than those obtained by the optimization of

modularity. This objective function is defined as

Qw−log−v(P ) = −∑C∈P

(∑i,j∈C

Aij2 |E|

)log

(∑i∈C

di2 |E|

)(5.1)

Van Laarhoven and Marchiori [72] have used the Louvain method to optimize Qw−log−v suc-

cessfully in the area of Protein Interaction Networks and found that it outperforms Modularity

Optimization.

This objective function is named so by the authors owing to its dependence onw =∑

i,j∈CAij

2|E|

and log v = log(∑

i∈Cdi

2|E|

). The w term takes values between zero and one, and the log v

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 63

term takes negative values as it is the logarithm of v which varies between zero and one. As

the negative value of the function is maximised, the objective function will score high for com-

munities that have a small total degree, and have most of the total degree of the community

accounted for by the internal edges. When all the nodes of the graph is considered as one

community, Qw−log−v takes the value 0.

The objective function can be optimized by the Louvain method with appropriate changes to

the Greedy Optimization Step described in Algorithm 2 in Chapter 4. In this step, after initial-

ization of each node in its own community, each node is iteratively moved to the community

that causes the greatest gain in Qw−log−v. This yields Algorithm 3.

The gain obtained in moving a node i from its community to that of node j in the case of

Algorithm 3: GreedyOptimizationQw−log−v

Input : The graph G

Output: The partition P of G into communities

begin

initialize P such that each node is in its own community

Qprev ←− Qw−log−v(P )

Qcurrent ←−∞

while Qcurrent > Qprev do

for node i ∈ V do

MaxGain←− 0

for node j adjacent to i do

MaxGain←− max(MaxGain,∆Qw−log−v(i, j))

end

move the node i to that community that results in max gain

end

end

end

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 64

Qw−log−v is written as

∆Qw−log−v(i, j) = −((IntEdges(Cj) + IntEdgesNode(Cj, i)

2absE)log(

V ol(Cj) + di2absE

)

+ (IntEdges(Ci)− IntEdgesNode(Ci, i)

2absE)log(

V ol(Ci)− di2absE

)

− (IntEdges(Cj)

2absE)log(

V ol(Cj)

2absE)− (

IntEdges(Ci)

2absE)log(

V ol(Ci)

2absE)) (5.2)

where V ol(C) =∑

i∈C diis the sum of degrees of the nodes in the community C and

IntEdgesNode(C, i) =∑

k∈C Aik is the number of edges node i has to other nodes in commu-

nity C and IntEdges(Ci) =∑i, j ∈ CAij is the number of internal edges in community C.

This gain can be computed in constant time by additionally pre-computing and maintaining

the IntEdges(Ci) and V ol(C) for each community and updating this when a node is moved.

The IntEdgesNode(C, i) is computed during the iteration over the neighbours of a node, thus

the time complexity of GreedyOptimizationQw−log−v is O(|E|). As the Louvain method

is a multi-level method whose time complexity depends on the greedy optimization step, its

complexity when used to optimize Qw−log−v is also O(|E|).

5.2.2 Optimizing Qw−log−v - Louvain method vs Single Step Greedy Op-

timization

Qw−log−v can be optimized using the greedy procedure outlined in Algorithm 3 to obtain a

partition of the graph into communities. As discussed in Chapter 4, the Louvain method is

a multi-level optimization procedure that uses the greedy optimization at every level. The

partition obtained at a level is used to generate a weighted graph for the next level by collapsing

the communities into nodes. This effectively results in the merging of small communities that

are formed at each level.

It is of interest to obtain small and homogeneous communities, thus the use of the multi-level

Louvain method may reduce the homogeneity of the communities by the merging of the small

communities at higher levels. An experimental validation of this is performed here.

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 65

Dataset and Evaluation

The procedure for generating the Datasets is described in Section 4.3. The Abilene WASH

1 Hour and 1 Day Trace Graphs superimposed separately with a 1000 or 10000 CHORD

Botnet to create 4 Datasets - WA-1H-CHO-1000,WA-1H-CHO-1000,WA-1D-CHO-1000,WA-

1D-CHO-10000.

The objective is to study the homogeneity of the communities produced by the optimization

of Qw−log−v by the Louvain and the Greedy Optimization methods. In order to capture the

homogeneity, the average precision of the botnet nodes over the communities containing atleast

one bot is computed. Let this be the set of communities B.

µPrecisionB (B) =1

|B|∑C∈B

P (C) (5.3)

where

Precision P (C) =# of bots in community C

|C|

Results and Discussion

The Louvain method to optimize Qw−log−v and the GreedyOptimizationQw−log−v methods

are applied to the WA-1H-CHO-1000,WA-1H-CHO-1000,WA-1D-CHO-1000,WA-1D-CHO-

10000 datasets, and the results are tabulated in Table 5.1. P represents the set of communities,

whose size is denoted by |P |, and B ⊂ P represents the communities with at-least 1 bot,

whose size is denoted by |B|. The mean precision is computed for the set B according to

Equation 5.3.

It can be observed from Table 5.1a that the average precision µPrecisionB of the botnet com-

munities is less than 0.6 in all cases when Qw−log−v is optimized using the Louvain method.

From Table 5.1b it can be seen that µPrecisionB is greater than 0.6 in all cases, confirming the

hypothesis that the merging procedure reduces the homogeneity of the botnet communities.

Thus when homogeneous communities are desired, GreedyOptimizationQw−log−v (Algo-

rithm 3) is the better candidate. This also offers an additional improvement in computational

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 66

Dataset |P | |B| µPrecisionB

WA-1H-CHO-1000 1720 24 0.37

WA-1H-CHO-10000 1429 185 0.59

WA-1D-CHO-1000 109 13 0.08

WA-1D-CHO-10000 84 31 0.45

(a) Optimization using Louvain

Dataset |P | |B| µPrecisionB

WA-1H-CHO-1000 8753 74 0.67

WA-1H-CHO-10000 8165 735 0.74

WA-1D-CHO-1000 2809 65 0.74

WA-1D-CHO-10000 3150 559 0.85

(b) Single Step Greedy Optimization

Table 5.1: Community Structure obtained by optimization of Qw−log−v using the Louvain

method on CHORD Botnets embedded in Abilene WASH Router Trace Graphs

speed when compared to the original Louvain method.

5.3 Differentiating between bot and benign communities

5.3.1 Properties of Structured P2P Botnets vs Properties of the Back-

ground

The subgraph corresponding to Structured P2P nodes are near regular – i.e. the nodes have

similar degrees, also implying that they are assortative(Section 3.2.6) – i.e. nodes with similar

degrees connect to other nodes with similar degrees. It has been shown by Yen and Reiter [79]

that assortativity makes the botnet robust and resistant to take down. This is in contrast with the

background graph, which is mostly dominated by nodes participating in client-server traffic.

The background graphs exhibit power law degree distributions (Section 3.2.4), indicating the

presence of hubs that attract a lot of connections. The hubs or high degree nodes which are

popular servers that attract a lot of connections from Internet hosts, and they rarely connect to

each other. This results in the graph being disassortative, as the low degree nodes connect to

dissimilar higher degree nodes.

The subgraph corresponding to Structured P2P nodes should have a relatively higher value of

density, when compared to the rest of the graph as they have to be robust against node failures

by adding redundant paths.

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 67

5.3.2 Properties of the small and homogeneous communities obtained by

the greedy optimization of Qw−log−v

As discussed in Section 5.2.2 the single step greedy optimization of Qw−log−v results in a large

number of small communities. This can be viewed the result of an incomplete or partial op-

timization of the function. Owing to this incomplete optimization, the structured P2P botnet

will fragment into smaller pieces and the high degree nodes or hubs tend to form small com-

munities dragging in their immediate neighbourhood consisting of the adjacent nodes.

The nodes adjacent to hubs may be benign or bots as computers infected with bots also partic-

ipate in client server traffic that is not malicious. However the bot nodes, which are a part of

a structured P2P subgraph will behave differently from the benign nodes. The benign nodes

will attach to the hubs forming star-like communities. The subgraph corresponding to such

communities will be disassortative as well as relatively sparse.

The bot nodes will be pulled in to a community with other bot nodes owing to the high internal

connectivity and resist being moved into the community of the hub. At the same time the hub

will resist joining the community of the bot nodes as it will tend to increase the total degree of

the community owing to its high degree. Due to the symmetry associated with the near regular

nature of the whole structured P2P graph, the pieces of the structured P2P botnet will tend to

break into fragments that are also near regular. This makes the subgraph corresponding to the

pieces or fragments of the botnet assortative, and relatively denser than to star-like communi-

ties owing to the presence of redundant paths between the nodes.

Thus in order to differentiate between communities of bots and communities of benign nodes,

the differences between the density and the assortativity of the corresponding subgraphs can

be exploited.

5.3.3 Mean Regular Degree mreg

A novel measure called mean regular degree or mreg is defined in this thesis, which measures

the mean number of connections of the node to other nodes with similar degrees.

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 68

mreg(G) =1

|V |∑i,j∈E

1

1 + |di − dj|(5.4)

The above measure mreg accounts for the degree correlation or assortativity(Section 3.2.6) of

a graph with the denominator term that depends on the difference of the degrees. This measure

is also proportional to the total degree of the community owing to the sum over all the edges,

accounting for the density. The time taken to evaluate this metric is O(|E|) as each edge is

considered only once during the computation, allowing it to be computed efficiently.

Properties of mean regular degree

For a k-regular graph (degree of each node is k), the value of mreg = 1|V |∑

i,j∈E 1 = 2|E||V | =

k|V ||V | = k Thus a ring graph (k = 2 ) of any size will have mreg = 2 and a clique or a complete

graph of size |V | will have mreg = |V |. Thus the value of mreg increases as the degree of each

node increases.

For a star graph of size |V | the value of mreg = 1|V |∑

i,j∈E1

1+|V |−2 = 2|E||V |−1 = 2(|V |−1)

|V |(|V |−1) = 2|V | .

Thus for a star graph, the value of mreg decreases as the size of the graph increases.

A dyad (two nodes with a single edge) which is both a 1-regular graph as well as a star graph

of 2 nodes will have mreg = 1. A graph that is larger than 2 nodes, and is star-like will have

small values of mreg < 1.

5.4 Robust and efficient method to identify nodes that are

part of structured P2P Botnet

A recent paper by Illioufotou et al.[28] proposed a method to classify different types of traffic

flows based on only the topological characteristics of the IP-IP traffic graph. The homophily

(like-attracts-like) among the same classes of traffic is exploited in this work. Given a set flows

which have been labelled they classify traffic by developing a method based on community de-

tection and a homophily based link classifier. They use the Louvain method for community

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 69

detection by recursively applying it to obtain smaller and homogeneous clusters containing

predominantly a single type of seed. Their algorithm is also able to classify P2P traffic given

seed flows labelled as P2P. They aim to perform traffic flow classification.

In this thesis the aim is to specialize their approach to detect structured P2P botnets by exploit-

ing the topological characteristics of structured P2P subgraphs instead of relying on labelled

flows. Further the recursive application of the Louvain method is avoided by exploiting the

use of greedy optimization of Qw−log−v. This motivates a method that relies on the obtained

homogeneous communities and the use ofmreg to be able to discard unlikely communities and

arrive at a set consisting of nodes that are a part of a structured P2P botnet.

The proposed algorithm consists of two stages. In the first stage, communities are obtained

by greedy optimization of Qw−log−v followed by the discarding of benign communities using

mreg. In the second stage, the communities retained from the first stage are collapsed to form

a weighted graph, and greedy modularity optimization is carried out followed by another fil-

tering of communities in order to obtain a final set of nodes that are a part of a structured P2P

Botnet.

5.4.1 Stage 1

The stage 1 of the proposed algorithm runs GreedyOptimzationQw−log−v. This will result

in homogeneous communities (Section 5.3.2). The mreg for each community is obtained by

extracting its corresponding subgraph and computing it according to Equation 5.4. A direct

filtering of the communities with mreg > 1 (based on properties in 5.3.3) would result in

poor recall as owing to the incomplete optimization(Section 5.3.2), some botnet communities

may be too small and thus indistinguishable from other benign communities owing to values of

mreg < 1. Therefore another chance has to be given for these communities to merge with other

bot communities to increase themreg value. Communities withmreg less than the median value

of mreg are selected for the next stage. This median value will be less than 1 as the number

of structured P2P bots are less than 50% of the total nodes of the graph. The median value

of mreg is preferred over the mean value over all community subgraphs as the sizes of the

communities are skewed, thus the mean mreg will be affected strongly by the larger and more

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 70

regular communities. Stage 1 is outlined in Algorithm 4.

Algorithm 4: Proposed Algorithm - Stage 1Input : The graph G

Output: A set of selected candidate bot communities Pselected

begin

P =←− GreedyOptimizationQw−log−v(G)

for community Ci ∈ P do

GCi= SubGraph(G,Ci)

mCi= mreg(GCi

)

end

mmed = Median(mC1 ...mC|P |)

Pselected ={Ci : m(GCi

) > mmed

}end

5.4.2 Stage 2

In order to allow some small botnet communities to merge the GreedyOptimizationQModularity

is run on the weighted graph obtained by collapsing the communities obtained from Stage

1. Here QModularity is used as the objective function as it allows for larger communities than

Qw−log−v. The aggressive merging due to optimization of modularity is reduced by the removal

of many hubs in Stage 1, which connect together several communities and play a role in the

merging of communities. The communities thus obtained are for the weighted graph, and are

converted back to communities of the original graph by expanding each node of the weighted

graph.

Finally the filtering step needs to be applied to separate out communities that are not very

regular. As discussed in Section 5.3.3. Communities of bots, being assortative and denser will

have values of mreg > 1, whereas the star like communities of benign nodes will have values

ofmreg < 1 making it a good rule for filtering out benign communities. The communities with

mreg ≤ 1 can be discarded at this stage.

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 71

Algorithm 5: Proposed Algorithm - Stage 2Input : The graph G,the set of selected communities from Stage1 Pselected

Output: A set Cfinal consisting of structured P2P botnet nodes

begin

Gw ←− CommunityAggregation(G,Pselected)

Pw ←− GreedyOptimizationQModularity(Gw)

P ←− PartitionExpansion(G,Gw, Pw)

for community Ci ∈ P do

GCi= SubGraph(G,Ci)

if mreg(GCi) > 1 and σdeg(Ci) < µdeg(Ci) then

Add all nodes in Ci to Cfinal

end

end

end

There can be cases of certain non-regular communities which may have a value of mreg ≥ 1,

this is because mreg is proportional to the internal degree of the community as well (Section

4.4.1). As structured P2P botnets are near regular, the mean internal degree, µdeg(GC) =

1|VC |

∑i=|VC |i=1 di and the variance σdeg(GC) =

√( 1|VC |

∑i=|VC |i=1 d2i )− µdeg(GC)2 of the subgraph

corresponding to each community is computed and communities which have σdeg(GC) >

µdeg(GC) are discarded. Stage 2 is outlined in Algorithm 5. The Community Aggregation

and the Partition Expansion steps are discussed in Section 4.2.

5.5 Evaluation

In this section a comprehensive evaluation of the proposed method will be carried out and its

performance will be compared to that of BotGrep[44]. The dataset generation procedure is

described in Section 4.3.

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 72

5.5.1 Metrics

The metrics used for evaluation consider the set Cfinal obtained from the proposed algorithm

and are given by

Precision =# of bots in Cfinal

|Cfinal|(5.5)

Recall =# of bots in Cfinal

Total # of bots(5.6)

FScore =2× Precision×RecallPrecision+Recall

(5.7)

Precision measures the purity of the set Cfinal, Recall measures the detection rate, and the

FScore being the harmonic mean of Precision and Recall provide a single summary statistic to

evaluate performance.

5.5.2 Performance on Abilene Trace Graphs

The proposed method and BotGrep are tested on the Abilene 1 hour and 1 day traces. The

CHORD, KOORDE and KADEMLIA topologies of 1000 and 10000 nodes are considered

(Section 4.3.3, and each of the 6 botnet graphs generated is embedded in each of the 4 back-

ground graphs WA-1H,CH-1H, WA-1D and CH-1D(Section 4.3.2) to create 24 distinct datasets

of one embedded botnet graph per background. The proposed method and BotGrep is run on

these datasets. The FScores (computed using Equation 5.7) achieved by the two algorithms

are compared in Figure 5.1. The Precision (Equation 5.5) and Recall(Equation 5.6) values are

compared in Tables 5.2.

1 Hour Traces

It can be observed from Figure 5.1a that the proposed method achieves FScores > 0.9 which is

slightly inferior and thus comparable to that of BotGrep which achieves FScores > 0.95. The

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 73

(a) WASH-1H (b) CHIC-1H

(c) WASH-1D (d) CHIC-1D

Figure 5.1: Performance comparison of the proposed method and BotGrep on Abilene Trace

Graphs - FScore

slightly lower performance of the proposed method is because of its precision values (Table

5.2a) which are between 0.8-0.9 compared to that of BotGrep which achieves a precision of 1

in most cases, the values of recall of both are almost the same at around 0.95 in most cases.

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 74

Botnet %BotsBotGrep Proposed

P R P R

CHO-1000 0.84 1.00 0.95 0.88 0.96

CHO-10000 8.39 1.00 0.95 0.91 0.95

KOO-1000 0.84 1.00 0.97 0.87 0.94

KOO-10000 8.39 0.99 0.94 0.91 0.93

KAD-1000 0.84 1.00 0.96 0.81 0.96

KAD-10000 8.39 0.99 0.95 0.90 0.96

(a) WA-1H

Botnet %BotsBotGrep Proposed

P R P R

CHO-1000 0.48 1.00 0.34 0.97 0.96

CHO-10000 4.85 1.00 0.03 0.95 0.96

KOO-1000 0.48 0.00 0.00 0.96 0.96

KOO-10000 4.85 1.00 0.08 0.97 0.95

KAD-1000 0.48 1.00 0.54 0.83 0.96

KAD-10000 4.85 1.00 0.34 0.97 0.96

(b) CH-1H

Botnet %BotsBotGrep Proposed

P R P R

CHO-1000 0.46 1.00 0.77 0.87 0.94

CHO-10000 4.60 0.98 0.79 0.97 0.92

KOO-1000 0.46 1.00 0.61 0.96 0.94

KOO-10000 4.60 1.00 0.76 0.98 0.90

KAD-1000 0.46 1.00 0.77 0.96 0.94

KAD-10000 4.60 0.98 0.80 0.97 0.93

(c) WA-1D

Botnet %BotsBotGrep Proposed

P R P R

CHO-1000 0.34 1.00 0.96 0.70 0.86

CHO-10000 3.36 1.00 0.35 0.97 0.87

KOO-1000 0.34 1.00 0.97 0.85 0.87

KOO-10000 3.36 1.00 0.40 0.95 0.80

KAD-1000 0.34 1.00 0.54 0.74 0.88

KAD-10000 3.36 1.00 0.43 0.97 0.89

(d) CH-1D

Table 5.2: Performance comparison of the proposed method and BotGrep on Abilene Trace

Graphs - Precision and Recall

Figure 5.1b indicates that in the case of CHIC-1H dataset, the proposed method outperforms

BotGrep with FScores above 0.95 in most cases as compared to BotGrep, which achieves

FScores less than 0.6 in most cases. Table 5.2b indicates that the poor performance of BotGrep

is due its poor recall (< 0.5 in most cases).

1 Day Traces

Figure 5.1c indicates that in the case of the WASH-1D dataset, the FScores achieved by the

proposed algorithm (> 0.95 in most cases) is comparable to that of BotGrep (about 0.9 in most

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 75

cases). Table 5.2c indicates that the proposed Algorithm outperforms BotGrep in recall (>

0.9 in all cases versus < 0.8 in all cases). The precision of both algorithms are comparable

, with BotGrep having slightly better precision (0.98-1) as compared to the proposed method

(<=0.98).

It can be observed from Figure 5.1d that the proposed method achieves higher FScores (above

0.8) than BotGrep (below 0.7) in most cases. Table 5.2d indicates that this is owing to the

lower recall achieved by BotGrep (below 0.6 in most cases) as compared to the proposed

algorithm(> 0.87 in most cases). In the case of precision, BotGrep achieves 1 in most cases,

the proposed method is comparable with above 0.95 in most cases.

Summary

The performance on the Abilene datasets is summarised in Figure 5.3. In the case of the

Botnet Size Background Density Proposed Algorithm vs BotGrep

Small SparseWA-1H Comparable

CH-1H Better

Small DenseWA-1D Comparable

CH-1D Better

Large SparseWA-1H Comparable

CH-1H Better

Large DenseWA-1D Comparable

CH-1D Better

Table 5.3: Summary of the proposed method with reference to BotGrep on the Abilene traces

(FScore)

Abilene Trace graphs the proposed method performs comparable or better than BotGrep (Table

5.3). Thus the limitations of Stability Optimization and Modularity Optimization in detecting

structured P2P botnets (Chapter 4) have been overcome.

In the next section the robustness of the algorithm for performance under conditions of reduced

visibility, as well its performance on a sparse, harder to detect topology is studied.

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 76

5.6 Robustness of the proposed algorithm

5.6.1 Robustness under conditions of partial visibility

In order to test the method’s robustness under conditions of partial visibility, 40% of the edges

of the botnet graph are removed before the embedding stage. The number 40% is based on the

findings in [44], and [77] that 60% of the botnet flows can be observed if monitors are deployed

at each of the Tier 1 Internet Service Providers based on Storm Botnet IP address distributions

in [44] and simulations in [77]. The FScores achieved by the two algorithms are compared in

Figure 5.1. The Precision and Recall values are compared in Tables 5.2. The datasets adopted

are otherwise the same as in Section 5.5.2. The metrics are evaluated as per Section 5.5.1

1 Hour Traces

It can be observed from Figure 5.2a that the proposed method achieves FScores around 0.9

which is slightly inferior yet comparable to that of BotGrep which achieves FScores > 0.95

as in the case of complete visibility. The slightly lower performance of the proposed method

is because of its precision values (Table 5.4a) which are between 0.8-0.9 compared to that of

BotGrep which achieves a precision of 0.99 in most cases, as well as the values of recall(0.9

versus > 0.9). The partial visibility thus affects the recall in this case.

Figure 5.2b indicates that for CHIC-1H dataset as per the case of complete visibility, the pro-

posed method outperforms BotGrep with FScores above 0.95 in most cases as compared to

BotGrep, which achieves FScores less than 0.5 in most cases. Table 5.4b indicates that the

poor performance of BotGrep is due its poor recall (< 0.3 in most cases).

1 Day Traces

Figure 5.2c indicates that for the WASH-1D dataset, the performance is again found to be as

per the case of complete visibility. The FScores achieved by the proposed algorithm (around

0.9 in most cases) is comparable to that of BotGrep (about 0.8 in most cases). Table 5.4c

indicates that the proposed Algorithm outperforms BotGrep in recall (> 0.8 in most cases

versus < 0.7 in all cases), indicating that the recall of both algorithms have been affected. The

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 77

(a) WASH-1H (b) CHIC-1H

(c) WASH-1D (d) CHIC-1D

Figure 5.2: Performance comparison of the proposed method and BotGrep on Abilene Trace

Graphs under conditions of partial visibility - FScore

precision of both algorithms are found to be comparable.

It can be observed from Figure 5.2d that the proposed method achieves lower FScores(between

0.6-0.8) as compared to BotGrep(above 0.9 in most cases). Table 5.4d indicates that this

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 78

Botnet %BotsBotGrep Proposed

P R P R

CHO-1000 0.84 0.98 0.94 0.87 0.91

CHO-10000 8.39 0.99 0.93 0.93 0.90

KOO-1000 0.84 0.99 0.95 0.85 0.92

KOO-10000 8.39 0.99 0.92 0.95 0.88

KAD-1000 0.84 0.99 0.93 0.89 0.88

KAD-10000 8.39 0.99 0.93 0.93 0.90

(a) WA-1H

Botnet %BotsBotGrep Proposed

P R P R

CHO-1000 0.48 1.00 0.13 0.83 0.94

CHO-10000 4.85 1.00 0.01 0.96 0.93

KOO-1000 0.48 1.00 0.24 0.93 0.95

KOO-10000 4.85 1.00 0.05 0.96 0.93

KAD-1000 0.48 0.00 0.00 0.92 0.94

KAD-10000 4.85 1.00 0.01 0.96 0.93

(b) CH-1H

Botnet %BotsBotGrep Proposed

P R P R

CHO-1000 0.46 1.00 0.69 0.96 0.80

CHO-10000 4.60 0.95 0.71 0.99 0.77

KOO-1000 0.46 1.00 0.69 0.97 0.85

KOO-10000 4.60 0.95 0.68 0.99 0.83

KAD-1000 0.46 0.97 0.69 0.85 0.81

KAD-10000 4.60 0.98 0.72 0.98 0.88

(c) WA-1D

Botnet %BotsBotGrep Proposed

P R P R

CHO-1000 0.34 0.94 0.87 0.76 0.62

CHO-10000 3.36 0.96 0.89 0.98 0.78

KOO-1000 0.34 0.97 0.88 0.70 0.74

KOO-10000 3.36 0.97 0.84 0.94 0.64

KAD-1000 0.34 1.00 0.25 0.62 0.51

KAD-10000 3.36 0.97 0.89 0.94 0.64

(d) CH-1D

Table 5.4: Performance comparison of the proposed method and BotGrep on Abilene Trace

Graphs under conditions of partial visibility - Precision and Recall

is owing to the lower recall achieved by the proposed algorithm (below 0.8 in most cases)

as compared to BotGrep(> 0.85 in most cases). In the case of precision, BotGrep achieves

1 in most cases, the proposed method however achieves lower precision in the case of the

smaller botnets, this is expected as the method depends on the density differences between the

botnet and the background, the CHIC-1D is the densest background graph among the datasets

considered (Section 4.3.2), and the removal of 40% edges of the small botnets sharply reduces

the density of the botnet making it harder to detect. For the larger botnet it achieves precision

> 0.9.

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 79

Summary

Table 5.5 summarises the performance of the algorithm under conditions of partial visibility in

Abilene Trace Graphs

Thus in cases of partial visibility the algorithm is still reasonably robust, but as the background

Botnet Size Background Density Proposed Algorithm vs BotGrep

Small SparseWA-1H Comparable

CH-1H Better

Small DenseWA-1D Comparable

CH-1D Worse

Large SparseWA-1H Comparable

CH-1H Better

Large DenseWA-1D Comparable

CH-1D Worse

Table 5.5: Performance(FScore) Summary of the proposed method with reference to BotGrep

on the Abilene traces under conditions of partial visibility

becomes denser the performance suffers.

5.6.2 Performance on the LEET-Chord Topology

The LEET-Chord structured P2P topology was proposed by Jelasity and Bilicki [30] to demon-

strate that there are techniques to make structured P2P traffic harder to detect. The topology

involves a modification of the CHORD topology, with clusters of CHORD graphs of log2|V |,

connected to each, with the restriction that only link (the long range link) exists between any

two clusters. They also propose a method of clustering so as to minimize the number of

different networks touched by the clusters, highlighting the limited effectiveness of local ap-

proaches. This topology is significantly sparser than the other P2P topologies considered, and

should be harder to detect due to this reason. The performance of the proposed method on the

same backgrounds by embedding LEET-Chord graphs 1000 and 10000 nodes on the Abilene

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 80

Background Graphs WASH-1H,CHIC-1H, WASH-1D,CHIC-1D. The metrics are evaluated as

per Section 5.5.1. From the Figure 5.3 and Table 5.6 it can be observed that in all datasets

(a) WASH-1H (b) CHIC-1H

(c) WASH-1D (d) CHIC-1D

Figure 5.3: Performance comparison of the proposed method and BotGrep on LEET-Chord

graphs embedded in Abilene Trace Graphs - FScore

the proposed method is able to detect the LEET-Chord botnet with an FScore of around 0.8,

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 81

Botnet %BotsBotGrep Proposed

P R P R

WA-1H-LC-1000 0.84 0.29 0.86 0.82 0.91

WA-1H-LC-10000 8.39 0.75 0.49 0.89 0.94

CH-1H-LC-1000 0.48 0.00 0.00 0.91 0.92

CH-1H-LC-10000 4.85 0.77 0.05 0.94 0.95

WA-1D-LC-1000 0.46 0.51 0.13 0.98 0.86

WA-1D-LC-10000 4.60 1.00 0.02 0.98 0.88

CH-1D-LC-1000 0.34 0.15 0.22 0.98 0.79

CH-1D-LC-10000 3.36 0.89 0.36 0.95 0.87

Table 5.6: Performance comparison of the proposed method and BotGrep on LEET-Chord

graphs embedded in Abilene Trace Graphs - Precision and Recall

with recall of above 0.85 and precision over 0.9 in most cases. This is in sharp contrast to that

of BotGrep. When using the default parameters according to the original BotGrep paper very

low accuracies were obtained. However in the original paper very high accuracies (>90%) are

reported for the LEET-CHORD topology. It may be possible to tweak the parameters in order

to obtain better results on the datasets considered.

5.6.3 Efficiency and Scalability

In this section the runtimes for the proposed method and BotGrep are compared. The runtime

for the proposed method is dominated by the first call toGreedyOptimizationQw−log−v(Algorithm

3), which takes O(|E|) time as the number of iterations t is typically not a function of the size

of the graph.

In order to test the running time of the proposed method and compare it with BotGrep, we

consider the Abilene background datasets , and we embed a CHORD graph of size 1000 in all

cases to profile the running time of the method. All experiments are done on a machine with

32-core AMD Opteron 2.4Ghz based processor with 128 GB of RAM , only a single core is

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 82

used. In figure 5.4, the runtime is plotted against the size of the graph.It can be clearly ob-

Figure 5.4: Runtime comparison of proposed method and BotGrep:Abilene 1 Day Traces and

1000 Node CHORD Botnet

served from the figure that the proposed method scales significantly better that BotGrep, being

about 350 times faster in graphs with more than 10 million edges.

Performance on the CAIDA Datasets

In the previous section, it is shown that the proposed method outperformed BotGrep in terms of

runtime. In this section the method is tested on the CAIDA-CH and CAIDA-SJ Datasets of 2.7

million and 8.2 million nodes respectively (Section 4.3.2. CHORD, KADEMLIA, KOORDE

and LEET-CHORD graphs of 10000 and 100000 nodes are considered for the following exper-

iments. The Precision, Recall and FScores (computed as per Section 5.5.1 ) are shown in Table

5.7. The X-Means algorithm used by BotGrep in the prefiltering step is not able to scale to

handle these graphs in the tested machine. In the case of the proposed method, the time taken

to process the largest graph (56 million directed edges) is less than 20 minutes on a single core.

The Table 5.7 indicates the proposed method achieves good FScores >0.95 in the case of the

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 83

CAIDA CHICAGO Graphs

Dataset %BotsProposed Method

precision recall FScore

CAIDA-CH-CHO-10000 0.37 0.94 0.93 0.93

CAIDA-CH-CHO-100000 3.7 1 0.92 0.96

CAIDA-CH-KOO-10000 0.37 0.9 0.95 0.92

CAIDA-CH-KOO-100000 3.7 0.97 0.94 0.96

CAIDA-CH-KAD-10000 0.37 0.62 0.94 0.75

CAIDA-CH-KAD-100000 3.7 0.95 0.93 0.94

CAIDA-CH-LC-10000 0.37 1 0.95 0.97

CAIDA-CH-LC-100000 3.7 1 0.93 0.96

CAIDA SANJOSE Graphs

Dataset %BotsProposed Method

precision recall FScore

CAIDA-SJ-CHO-10000 0.12 0.56 0.95 0.7

CAIDA-SJ-CHO-100000 1.2 0.97 0.95 0.96

CAIDA-SJ-KOO-10000 0.12 0.42 0.94 0.58

CAIDA-SJ-KOO-100000 1.2 0.83 0.93 0.88

CAIDA-SJ-KAD-10000 0.12 0.83 0.93 0.88

CAIDA-SJ-KAD-100000 1.2 1 0.92 0.96

CAIDA-SJ-LC-10000 0.12 0.72 0.96 0.82

CAIDA-SJ-LC-100000 1.2 0.97 0.94 0.95

Table 5.7: Performance of the Proposed Method on CAIDA Datasets - Precision, Recall and

FScore

CAIDA-CH trace graph, and FScores of >0.85 in the case of the CAIDA-SJ trace graph. The

lower accuracy observed in the case of the 10000 node botnets in the latter (<0.8) is due to

the very small percentage of the bots (0.1 % of the dataset), which affects the precision of the

Chapter 5. A robust algorithm for identification of Structured P2P Botnets 84

algorithm (about 0.5), but the number of false positives is still very small compared to the size

of the whole graph (0.1%).

5.7 Conclusion

In this chapter a robust and efficient method to detect nodes that are a part of a structured P2P

botnet given a traffic graph. The proposed algorithm exploited the small but homogeneous

communities obtained by a greedy optimization of the function Qw−log−v proposed and suc-

cessfully applied in Protein Interaction Networks by VanLaarhoven and Marchiori [73]. The

greedy optimization is shown to produce more homogeneous communities than the optimiza-

tion of Qw−log−v using the Louvain method and thus chosen for use in the proposed algorithm.

The differences of the topological properties - assortativity and density, of structured P2P bot-

net communities and benign communities were discussed. In order to exploit these differences,

a novel measure mean regular degree mreg is proposed which captured the assortativity and

the density of a graph and the properties of mreg were studied

The proposed method is a two stage algorithm that used greedy optimization of Qw−log−v fol-

lowed by a filtering of communities with low values ofmreg in the first stage. The second stage

involved greedy optimization of Modularity and a second filtering of communities in order to

detect the set of nodes likely to be structured P2P bots

The proposed method is extensively validated, and found to be comparable in performance

with BotGrep. It is found to be reasonably robust to conditions of partial visibility barring the

case of very dense background graphs, and achieved good performance on the harder to detect

LEET-Chord topology.

The runtime of the algorithm is found to be 300 times lower than BotGrep on graphs of tens of

millions of edges. The algorithm is able to handle a 8 million node and 50 million edge graph

in roughly 20 minutes on a single core.

Chapter 6

Summary and Conclusions

6.1 Summary of Contributions

The summary of the major contributions and conclusions from this thesis are provided below.

6.1.1 Efficiency Comparison of Community Detection Algorithms

This thesis surveys the popular community detection algorithms proposed in literature. Al-

gorithms with low theoretical time complexities such as Label Propagation [54], Infomap

[56] and Louvain Method [3] have been implemented and compared on large LFR benchmark

graphs to study their efficiency.

6.1.2 Detection of Structured P2P Botnets using the Louvain Method

From the efficiency comparison of Community Detection Algorithms, the Louvain method is

selected to detect structured P2P botnets, as it was found to perform the fastest.

The Louvain method allows multiple objective functions to be optimized in a multi-level

greedy process. The method originally used Modularity as the objective function. This has

been applied in this thesis to detect structured P2P botnets on synthetically generated datasets

as per [44]. The dataset generation process involves generating topology graphs of CHORD

85

Chapter 6. Summary and Conclusions 86

[65], KOORDE [33] and KADEMLIA [42] and embedding them in a background graph con-

structed from real-world network traces. The traces were from two different sources. The first

set of traces include NetFlow data captured at core routers of the Abilene ISP. The second set

comprised of packet traces captured at an Internet Point of Presence (PoP). It is found that

performance of Modularity maximization using the Louvain method is comparable to that of

BotGrep in the cases of sparse background graphs, . In the cases where the density of the

background is high, this method resulted in a large number of benign nodes being detected

as bots. This is due to the resolution limit of modularity and its preference for large loosely

connected communities.

In order to overcome this limitation the Louvain method is then used to optimize Stability, a

multiresolution objective function proposed by Lambiotte et al. [36]. This requires the setting

of a resolution parameter t to control the size of the communities. This is then applied to detect

structured P2P botnets. Although for certain values of t the performance is comparable to that

of BotGrep for small botnets, several runs of the method at different values of t had to be run

in order to detect all sizes of botnets, increasing the computational complexity.

6.1.3 Robust and Efficient method to detect Structured P2P Botnets

It is found that neither Modularity Optimization nor Stability Optimization are robust or gen-

eral. However the optimization of Stability at values of t < 1 is found to result in small but

homogeneous communities. In order to overcome the limitations of setting the parameter t, a

third objective functionQw−log−v proposed by VanLaarhoven and Marchiori [73] is considered.

This objective function has previously been used in the case of Protein Interaction Networks

successfully, and used in this thesis to detect structured P2P botnets for the first time. It is

also shown that a single-step greedy optimization of Qw−log−v resulted in more homogeneous

communities than the optimization of the same function using a multi-level method such as

Louvain.

The differences of the topological properties - assortativity and density, of structured P2P bot-

net communities and benign communities were discussed. In order to exploit these differences,

Chapter 6. Summary and Conclusions 87

a novel measure mean regular degree mreg is proposed which capture the assortativity and the

density of a graph and the properties of mreg were studied. The proposed algorithm combines

the use of greedy community detection by optimizing Qw−log−v and community filtering using

mreg in order to identify nodes that are a part of a structured P2P botnet.

Accuracy

The algorithm is tested extensively on a large number of datasets and found to be comparable

in performance with BotGrep in most cases. In the case of the Abilene datasets, the proposed

method achieves FScores of about 0.9 in the majority of the cases. The proposed method

is reasonably robust under conditions of partial visibility - in the case of sparser background

graphs it achieves FScores around 0.9 and is comparable to BotGrep, the performance is af-

fected when datasets with denser backgrounds are considered, but it still achieves FScores of

0.7. The proposed method achieves FScores of around 0.9 in the case of the sparser and harder

to detect LEET-Chord botnet topology[30].

Efficiency

The runtime of the algorithm is found to be 300 times lower than BotGrep on graphs of tens of

millions of edges. The algorithm is able to handle a 8 million node and 50 million edge graph

in roughly 20 minutes on a single core, which is lesser than the time duration of the captured

traces, indicating that the method can be applied in realtime.

6.2 Directions for Future Work

The accuracy of the Louvain method may be increased by considering the packet level features

over the graph topology. These features need to be captured only for the hosts not discarded

at the end of Stage 1 of the algorithm, which may be feasible even for backbone level traffic.

These features can then be used to weight the graph edges based on feature vector similarity

functions.

Chapter 6. Summary and Conclusions 88

The scalability of the Louvain method can be further improved by implementing it on a dis-

tributed environment, allowing it to handle graphs with very high memory requirements

The field of botnet detection in general can benefit greatly from the rapid improvements tak-

ing place in the field of Community Detection Algorithms. Dynamic Community Detection

algorithms can incorporate temporal features of the traffic graph to perform incremental detec-

tion/tracking of botnets. Overlapping Community Detection Algorithms can be used to handle

cases where a node is infected with more than one bot. Local Community Identification Algo-

rithms can be used to detect other members of the botnet given a few seed nodes.

References

[1] Paul Barford and Vinod Yegneswaran. An inside look at Botnets. In Mihai Christodor-

escu, Somesh Jha, Douglas Maughan, Dawn Song, and Cliff Wang, editors, Malware

Detection, volume 27 of Advances in Information Security, pages 171–191. Springer US,

Boston, MA, 2007.

[2] James R. Binkley and Suresh Singh. An algorithm for anomaly-based botnet detection.

page 7, July 2006.

[3] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast

unfolding of communities in large networks. Journal of Statistical Mechanics: Theory

and Experiment, 2008(10):6, March 2008.

[4] Hyunsang Choi, Heejo Lee, and Hyogon Kim. BotGAD. In COMSWARE ’09 Proceed-

ings of the Fourth International ICST Conference on COMmunication System softWAre

and middlewaRE, page 1, New York, New York, USA, June 2009. ACM Press.

[5] Benoit Claise. Cisco systems netflow services export version 9. 2004.

[6] Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding community structure

in very large networks. Physical review E, 70(6):066111, 2004.

[7] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions

in empirical data. SIAM review, 51(4):661–703, 2009.

[8] Michele Coscia, Fosca Giannotti, and Dino Pedreschi. A Classification for Community

Discovery Methods in Complex Networks. 2012.

89

REFERENCES 90

[9] Baris Coskun, Sven Dietrich, and Nasir Memon. Friends of an enemy. In ACSAC ’10 Pro-

ceedings of the 26th Annual Computer Security Applications Annual Conference, page

131, New York, New York, USA, December 2010. ACM Press.

[10] David Dagon. Botnet detection and response. In OARC workshop, volume 2005, 2005.

[11] David Dagon, Guofei Gu, Christopher P. Lee, and Wenke Lee. A Taxonomy of Botnet

Structures. In ACSAC ’07 Proceedings of the 23rd Annual Computer Security Applica-

tions Annual Conference, pages 325–339. IEEE, December 2007.

[12] Inc. Damballa. Household Botnet Infections, 2012.

[13] Carlton R. Davis, Stephen Neville, Jose M. Fernandez, Jean-Marc Robert, and John

Mchugh. Structured Peer-to-Peer Overlay Networks: Ideal Botnets Command and Con-

trol Infrastructures? In Sushil Jajodia and Javier Lopez, editors, ESORICS ’08 Pro-

ceedings of the 13th European Symposium on Research in Computer Security: Computer

Security, volume 5283 of Lecture Notes in Computer Science, pages 461–480, Berlin,

Heidelberg, October 2008. Springer Berlin Heidelberg.

[14] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigen-

vectors a multilevel approach. Pattern Analysis and Machine Intelligence, IEEE Trans-

actions on, 29(11):1944–1957, 2007.

[15] Maryam Feily, Alireza Shahrestani, and Sureswaran Ramadass. A Survey of Botnet and

Botnet Detection. In ICESIST ’09 Third International Conference on Emerging Security

Information, Systems and Technologies, pages 268–273. IEEE, June 2009.

[16] Charles M Fiduccia and Robert M Mattheyses. A linear-time heuristic for improving

network partitions. In Design Automation, 1982. 19th Conference on, pages 175–181.

IEEE, 1982.

[17] Lester R Ford and Delbert R Fulkerson. Maximal flow through a network. Canadian

Journal of Mathematics, 8(3):399–404, 1956.

REFERENCES 91

[18] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174,

February 2010.

[19] Santo Fortunato and Marc Barthelemy. Resolution limit in community detection. Pro-

ceedings of the National Academy of Sciences, 104(1):36–41, 2007.

[20] Jerome Francois, Shaonan Wang, Radu State, and Thomas Engel. BotTrack: tracking

botnets using NetFlow and PageRank. In NETWORKING ’11 Proceedings of the 10th

international IFIP TC 6 conference on Networking, pages 1–14, May 2011.

[21] Frederic Giroire, Jaideep Chandrashekar, Nina Taft, Eve Schooler, and Dina Papagian-

naki. Exploiting Temporal Persistence to Detect Covert Botnet Channels. In RAID ’09

Proceedings of the 12th International Symposium on Recent Advances in Intrusion De-

tection, pages 326 – 345, 2009.

[22] M. Girvan and M. E. J. Newman. Community structure in social and biological net-

works. Proceedings of the National Academy of Sciences of the United States of America,

99(12):7821–6, June 2002.

[23] Sergey Golovanov and Igor Soumenkov. TDL4 – Top Bot, 2011.

[24] Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. BotMiner: clustering analy-

sis of network traffic for protocol- and structure-independent botnet detection. In USENIX

Security ’08 Proceedings of the 17th USENIX Security Symposium, pages 139–154, July

2008.

[25] Guofei Gu, Phillip Porras, Vinod Yegneswaran, Martin Fong, and Wenke Lee. BotH-

unter: detecting malware infection through IDS-driven dialog correlation. In USENIX

Security ’07 Proceedings of the 16th USENIX Security Symposium, page 12, August

2007.

[26] Guofei Gu, Junjie Zhang, and Wenke Lee. BotSniffer: Detecting botnet command and

control channels in network traffic. In NDSS ’08 Proceedings of the 15th Annual Network

and Distributed System Security Symposium, page 18. Citeseer, 2008.

REFERENCES 92

[27] Nicholas Ianelli and Aaron Hackworth. Botnets as a vehicle for online crime. Technical

report, CERT Coordination Center, 2005.

[28] Marios Iliofotou, Brian Gallagher, Tina Eliassi-Rad, Guowu Xie, and Michalis Falout-

sos. Profiling-By-Association. In Co-NEXT ’10 Proceedings of the 6th International

COnference on emerging Networking EXperiments and Technologies, page 1, New York,

New York, USA, November 2010. ACM Press.

[29] Padmini Jaikumar and Avinash C. Kak. A graph-theoretic framework for isolating botnets

in a network. Security and Communication Networks, pages n/a–n/a, February 2012.

[30] Mark Jelasity and Vilmos Bilicki. Towards automated detection of peer-to-peer botnets:

on the limits of local approaches. In LEET ’09 Proceedings of the 2nd USENIX confer-

ence on Large-scale exploits and emergent threats: botnets, spyware, worms, and more,

page 3, April 2009.

[31] Jham3. File:Network Community Structure.svg, 2011.

[32] Hongling Jiang and Xiuli Shao. Detecting P2P botnets by discovering flow dependency

in C&C traffic. Peer-to-Peer Networking and Applications, June 2012.

[33] M Frans Kaashoek and David R Karger. Koorde: A simple degree-optimal distributed

hash table. In Peer-to-Peer Systems II, pages 98–107. Springer, 2003.

[34] George Karypis and Vipin Kumar. Metis-unstructured graph partitioning and sparse ma-

trix ordering system, version 2.0. 1995.

[35] BW Kernighan and S Lin. An eflicient heuristic procedure for partitioning graphs. Bell

system technical journal, 1970.

[36] R Lambiotte, J. C. Delvenne, and M Barahona. Laplacian Dynamics and Multiscale

Modular Structure in Networks. arXiv preprint arXiv:0812.1770, pages 1–29, December

2008.

REFERENCES 93

[37] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for

testing community detection algorithms. page 6, May 2008.

[38] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of

linear differential and integral operators. United States Governm. Press Office, 1950.

[39] Wei Lu, Mahbod Tavallaee, and Ali A. Ghorbani. Automatic discovery of botnet com-

munities on large-scale communication networks. In ASIACCS ’09 Proceedings of the

4th International Symposium on Information, Computer, and Communications Security,

page 1, New York, New York, USA, March 2009. ACM Press.

[40] James MacQueen et al. Some methods for classification and analysis of multivariate

observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics

and probability, volume 1, page 14. California, USA, 1967.

[41] Mohammad M. Masud, Tahseen Al-khateeb, Latifur Khan, Bhavani Thuraisingham, and

Kevin W. Hamlen. Flow-based identification of botnet traffic by mining multiple log

files. In 2008 First International Conference on Distributed Framework and Applica-

tions, pages 200–206. IEEE, October 2008.

[42] Petar Maymounkov and David Mazieres. Kademlia: A peer-to-peer information system

based on the xor metric. In Peer-to-Peer Systems, pages 53–65. Springer, 2002.

[43] Hardy Michael. File:De bruijn graph-for binary sequence of order 4.svg, 2006.

[44] Shishir Nagaraja, Prateek Mittal, Chi-Yao Hong, Matthew Caesar, and Nikita Borisov.

BotGrep: finding P2P bots with structured graph analysis. In USENIX Security’10 Pro-

ceedings of the 19th USENIX Security Symposium, pages 7–7, August 2010.

[45] Mark EJ Newman. Mixing patterns in networks. Physical Review E, 67(2):026126, 2003.

[46] Mark EJ Newman. A measure of betweenness centrality based on random walks. Social

networks, 27(1):39–54, 2005.

REFERENCES 94

[47] Mark EJ Newman. Finding community structure in networks using the eigenvectors of

matrices. Physical review E, 74(3):036104, 2006.

[48] Mark EJ Newman. Modularity and community structure in networks. Proceedings of the

National Academy of Sciences, 103(23):8577–8582, 2006.

[49] Pascal Pons and Matthieu Latapy. Computing communities in large networks using

random walks. In Computer and Information Sciences-ISCIS 2005, pages 284–293.

Springer, 2005.

[50] Phillip Porras, Hassen Saidi, and Vinod Yegneswaran. A Multi-perspective Analysis of

the Storm (Peacomm) Worm, 2007.

[51] Phillip Porras, Hassen Saıdi, and Vinod Yegneswaran. A foray into Conficker’s logic

and rendezvous points. In 2nd Usenix Workshop on Large-Scale Exploits and Emergent

Threats (LEET’09), page 7, April 2009.

[52] Bill Pringlemeir. File:Dht example.png, 2007.

[53] Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico

Parisi. Defining and identifying communities in networks. Proceedings of the National

Academy of Sciences of the United States of America, 101(9):2658–2663, 2004.

[54] Usha Nandini Raghavan, Reka Albert, and Soundar Kumara. Near linear time al-

gorithm to detect community structures in large-scale networks. Physical Review E,

76(3):036106, 2007.

[55] Matei Ripeanu. Peer-to-peer architecture case study: Gnutella network. In Peer-to-Peer

Computing, 2001. Proceedings. First International Conference on, pages 99–100. IEEE,

2001.

[56] Martin Rosvall, Daniel Axelsson, and Carl T Bergstrom. The map equation. The Euro-

pean Physical Journal Special Topics, 178(1):13–23, 2009.

REFERENCES 95

[57] Venu Satuluri and Srinivasan Parthasarathy. Scalable graph clustering using stochastic

flows: applications to community discovery. In Proceedings of the 15th ACM SIGKDD

international conference on Knowledge discovery and data mining, pages 737–746.

ACM, 2009.

[58] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, August

2007.

[59] Michael T Schaub, Jean-Charles Delvenne, Sophia N Yaliraki, and Mauricio Barahona.

Markov dynamics as a zooming lens for multiscale community detection: non clique-like

communities and the field-of-view limit. PloS one, 7(2):e32210, 2012.

[60] Antoine Schonewille and Dirk-Jan van Helmond. The domain name service as an IDS.

Research Project for the Master System-and Network Engineering at the University of

Amsterdam, 2006.

[61] Xuemin Shen, Heather Yu, John Buford, and Mursalin Akon. Handbook of peer-to-peer

networking, volume 1. Springer Heidelberg, 2010.

[62] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Anal-

ysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000.

[63] Sergio S.C. Silva, Rodrigo M.P. Silva, Raquel C.G. Pinto, and Ronaldo M. Salles. Bot-

nets: A survey. Computer Networks, October 2012.

[64] Stefan Ortloff. FAQ: Disabling the new Hlux/Kelihos Botnet, 2013.

[65] Ion Stoica, Robert Morris, David Karger, M Frans Kaashoek, and Hari Balakrishnan.

Chord: A scalable peer-to-peer lookup service for internet applications. In ACM SIG-

COMM Computer Communication Review, volume 31, pages 149–160. ACM, 2001.

[66] W. Strayer, Robert Walsh, Carl Livadas, and David Lapsley. Detecting Botnets with Tight

Command and Control. In Proceedings. 2006 31st IEEE Conference on Local Computer

Networks, pages 195–202. IEEE, November 2006.

REFERENCES 96

[67] Inc. Symantec. Internet Security Threat Report, 2013.

[68] Seth Terashima. File:Chord network.png, 2010.

[69] Gergely Tibely and Janos Kertesz. On the equivalence of the label propagation method

of community detection and a potts model approach. Physica A: Statistical Mechanics

and its Applications, 387(19):4982–4984, 2008.

[70] Vincent A Traag, Paul Van Dooren, and Y Nesterov. Narrow scope for resolution-limit-

free community detection. Physical Review E, 84(1):016114, 2011.

[71] Stijn Marinus van Dongen. Graph clustering by flow simulation. 2000.

[72] Twan van Laarhoven and Elena Marchiori. Robust community detection methods with

resolution parameter for complex detection in protein protein interaction networks. In

Pattern Recognition in Bioinformatics, pages 1–13. Springer, 2012.

[73] Twan van Laarhoven and Elena Marchiori. Graph clustering with local search optimiza-

tion: the resolution bias of the objective function matters most. Physical review. E,

Statistical, nonlinear, and soft matter physics, 87(1):012812, January 2013.

[74] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing,

17(4):395–416, 2007.

[75] C Walsworth, E Aben, K Claffy, and D Andersen. The ucsd caida anonymized 2011

internet traces, 2011.

[76] Ping Wang, Sherri Sparks, and Cliff C Zou. An advanced hybrid peer-to-peer botnet.

Dependable and Secure Computing, IEEE Transactions on, 7(2):113–127, 2010.

[77] Guanhua Yan, Stephan Eidenbenz, Sunil Thulasidasan, Pallab Datta, and Venkatesh

Ramaswamy. Criticality analysis of Internet infrastructure. Computer Networks,

54(7):1169–1182, May 2010.

REFERENCES 97

[78] Ting-Fang Yen and Michael K. Reiter. Are Your Hosts Trading or Plotting? Telling

P2P File-Sharing and Bots Apart. In ICDCS ’10 IEEE 30th International Conference on

Distributed Computing Systems, pages 241–252. IEEE, June 2010.

[79] Ting-Fang Yen and Michael K. Reiter. Revisiting botnet models and their implications

for takedown strategies. In Pierpaolo Degano and Joshua D. Guttman, editors, POST’12

Proceedings of the First international conference on Principles of Security and Trust,

volume 7215 of Lecture Notes in Computer Science, pages 249–268, Berlin, Heidelberg,

March 2012. Springer Berlin Heidelberg.

[80] Hossein Rouhani Zeidanloo, M. Safari, and Mazdak Zamani. A taxonomy of Botnet

detection techniques. In ICCSIT ’10 3rd International Conference on Computer Science

and Information Technology, pages 158–162. IEEE, July 2010.

[81] Junjie Zhang, Roberto Perdisci, Wenke Lee, Unum Sarfraz, and Xiapu Luo. Detecting

stealthy P2P botnets using statistical traffic fingerprints. In 2011 IEEE/IFIP 41st Inter-

national Conference on Dependable Systems & Networks (DSN), pages 121–132. IEEE,

June 2011.