154
Streaming Temporal Graphs by Eric L. Goodman B.S. Computer Science and Math, Brigham Young University, 2003 M.S. Computer Science, Brigham Young University, 2005 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2019

Streaming Temporal Graphs - CU Scholar

Embed Size (px)

Citation preview

Streaming Temporal Graphs

by

Eric L. Goodman

B.S. Computer Science and Math, Brigham Young University, 2003

M.S. Computer Science, Brigham Young University, 2005

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2019

This thesis entitled:Streaming Temporal Graphswritten by Eric L. Goodman

has been approved for the Department of Computer Science

Dr. Dirk Grunwald

Dr. Daniel Massey

Dr. Eric Keller

Dr. Qin Lv

Date

The final copy of this thesis has been examined by the signatories, and we find that both thecontent and the form meet acceptable presentation standards of scholarly work in the above

mentioned discipline.

iii

Goodman, Eric L. (Ph.D., Computer Science)

Streaming Temporal Graphs

Thesis directed by Dr. Dirk Grunwald

We provide a domain specific language called the Streaming Analytics Language (SAL) to write concise

but expressive analyses of streaming temporal graphs. We target problems where the data comes as an

infinite stream and where the volume is prohibitive, requiring a single pass over the data and tight spatial

and temporal complexity constraints. Also, each item in the stream can be thought of as an edge in a graph,

and each edge has an associated timestamp and duration.

A real-world problem that is a streaming temporal graph is cyber security data. Machines communi-

cate with each other within a network, forming a streaming sequence of edges with temporal information.

As such, we elucidate the value of SAL by applying it to a large range of cyber-related problems. With

a combination of vertex-centric computations that create features per vertex, and subgraph matching to

find communication patterns of interest,D we cover a wide spectrum of important cyber use cases. As an

example, we discuss Verizon’s Data Breach Investigations Report, and show how SAL can be used to capture

most of the nine different categories of cyber breaches. Also, we apply SAL to discovering botnet activity

within network traffic in 13 different scenarios, with an average area under the curve (AUC) of the receiver

operating characteristic (ROC) of 0.87.

Besides SAL as a language, as another contribution we present an implementation we call the Stream-

ing Analytics Machine (SAM). With SAM, we can run SAL programs in parallel on a cluster, achieving

rates of a million netflows per second, and scaling to 128 nodes or 2560 cores. We compare SAM to another

streaming framework, Apache Flink, and find that Flink cannot scale past 32 nodes for the problem of finding

triangles (a subgraph of three interconnected nodes) within the streaming graph. Also, SAM excels when

the subgraphs are frequent, continuing to find the expected number of subgraphs, while Flink performance

degrades and under-reports. Together, SAL and SAM provide an expressive and scalable infrastructure for

performing analyses on streaming temporal graphs.

iv

Acknowledgements

A big thanks goes to my advisor, Dirk Grunwald, who helped me hone this dissertation over

a lengthy period of time. The path I took was shaped by him and allowed me to create a much

better product and a more relevant research contribution.

I also thank the committee members for their time and input. This includes not only the

current committee of Daniel Massey, Eric Keller, Sangtae Ha, and Qin Lv, but also former members

John Black and Sriram Sankaranarayana.

Sandia National Laboratories funded my doctoral degree through their university programs,

for which I am grateful. I especially appreciate the unwavering support and patience from my

managers, Judy Spomer and Michael Haass.

My parents were instrumental in fostering my curiousity and desire to learn, which lead me

down this path of educational advancement. I still haven’t invented the transmat beam, but that

will be my next project.

v

Contents

Chapter

1 Introduction 1

2 The Streaming Analytics Language 5

2.1 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Static Vertex Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Temporal Subgraph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 SAL as a Query Language for Cyber Problems 29

3.1 Crimeware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Cyber-Espionage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Point of Sale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Privilege Misuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Web Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Related Work 41

4.1 Vertex-centric Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Graph Domain Specific Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Green-Marl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

vi

4.2.2 Ligra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.3 Galois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.4 Gemini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.5 Grazelle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.6 GraphIt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Linear Algebra-based Graph Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Combinatorial BLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.2 Knowledge Discovery Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.3 Pegasus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.4 GraphMAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Graph APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 GPU Graph Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.7 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.8 Data Stream Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.9 Network Telemetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Implementation 63

5.1 Partitioning the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2 Vertex-centric Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3 Subgraph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 Vertex Centric Computations 83

6.1 Classifier Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

vii

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7 Temporal Subgraph Matching 92

7.1 SAM Parameter Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.2 Apache Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.3 Comparison Results: SAM vs Flink . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.4 Simulating other Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8 Conclusions 106

Bibliography 108

Appendix

A SAL and Flink Programs 125

A.1 Machine Learning Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

A.2 Apache Flink Triangle Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 126

viii

Tables

Table

4.1 A table of vertex-centric frameworks and various attributes. . . . . . . . . . . . . . . 46

5.1 This table shows the conciseness of the SAL language. For the Disclosure pipeline

(discussed in Section 2.1), the machine learning pipeline (discussed in Section6.1),

and the temporal triangle query (discussed in Chapter 7) the amount of code needed

is reduced between 10-25 times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 A listing of the major classes in SAM and how they map to language features in SAL.

We also make the distinction between consumers, producers, and feature creators.

Producers all share a parallelFeed method that sends the data to registered con-

sumers, which is how parallelization is achieved. Each of the consumers receive data

from producers. If a consumer is also a feature creator, it performs a computation on

the received data, generating a feature, which is then added to a thread-safe feature

map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.1 A more detailed look at the AUC results. The second column is the AUC of the

ROC after training on the first half of the data and then testing on the second half.

The third column is reversed: trained on the second half and tested on the first half. 87

ix

Figures

Figure

1.1 This dissertation is an intersection of four different research areas and presents a

domain specific language (DSL) to express streaming temporal graph and machine

learning queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 This figure represents a common pattern for vertex-centric computations in SAL

programs. A stream of edges is partitioned into N vertices that are then operated

on by o operators that compute over a sliding window of length n items. . . . . . . 6

2.2 Illustration showing the subgraph nature of a Watering Hole Attack. The target

machine accesses a popular site, Bait. Shortly after accessing Bait, Target begins

communicating to a new machine, the Controller . . . . . . . . . . . . . . . . . . . 10

2.3 This figure demonstrates the use of the STREAM BY statement to logically divide

a single stream into multiple streams. The STREAM BY statement is following

by a FOREACH GENERATE statement that creates estimates on the top two

most frequent destination ports and their frequencies. The FILTER statement uses

this information to downselect to IPs where the top two ports account for 90% of

the traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 This figure demonstrates the use of the TRANSFORM and COLLAPSE BY

statements to define parts of the Disclosure pipeline. . . . . . . . . . . . . . . . . . 19

x

3.1 A subgraph showing the flow of messages for data exfiltration. A controller sends

a small control message to an infected machine, which then sends large amounts of

data to a drop box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 A subgraph showing an XSS attack. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 SPARQL Query number two from the Lehigh University Benchmark. . . . . . . . . . 57

5.1 Architecture of the system: data comes in over a socket layer and is then distributed

via ZeroMQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2 The push pull object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Traditional Compressed Sparse Row (CSR) data structure. . . . . . . . . . . . . . . 72

5.4 Modified Compressed Sparse Row (CSR) data structure. . . . . . . . . . . . . . . . . 72

5.5 An overview of the algorithm employed to detect subgraphs of interest. Asyn-

chronous threads are launched to consume each edge. The consume threads then

performs five different operations. Another set of threads are pulling requests from

other nodes for edges within the Request Communicator. Finally, one more set of

threads pull edges from other nodes in reply to edge requests. . . . . . . . . . . . . . 75

6.1 This graphic shows the AUC of the ROC curve for each of the CTU scenarios. We

first train on the first half of the data and test on the second half, and then switch. . 87

6.2 Weak Scaling Results: For each run with n nodes, a total of n million netflows

were run through the system with a million contiguous netflows per node randomly

selected from the CTU dataset. There were two types of runs, with renamed IPs (to

create the illusion of a larger network) or with randomized IPs. For the renamed

IPs, there continues to be improved throughput until 61 nodes/ 976 cores. For the

randomized IPs, the scaling continues through the total 64 nodes / 1024 cores. . . . 89

xi

6.3 This shows the weak scaling efficiency. The efficiency continues to degrade for the

renamed IPs, but for the randomized IPs, the efficiency hovers a little above 0.5. This

is largely due to the fact that the randomized IPs almost perfectly load-balances the

work, while the renamed IPs encounters load imbalances because of the power-law

nature of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.1 We vary the number of pull threads and the number of push sockets (NPS) for the

RequestCommunicator to determine the best setting for maximum throughput. The

RequestMap object utilizes the RequestCommunicator during the process(edge)

function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.2 Shows the maximum throughput for SAM and Flink. We vary the number of nodes

with values of 16, 32, 64, and 120. We vary the number of vertices, |V |, between

25,000 and 150,000. Flink has the best throughput with 32 nodes, after which adding

additional nodes hurts the overall aggregate rate and the number of found triangles

continues to decrease. SAM is able to scale to 120 nodes. . . . . . . . . . . . . . . . 101

7.3 Another view of the data where we plot the aggregate throughput versus the total

number of triangles to find. This view emphasizes that SAM does better with more

complex problems, i.e. more triangles. . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.4 Another look at the maximum throughput for SAM and Flink. Here we relax the re-

striction that the number of reported triangles must be higher than 85% of expected,

but keep the latency restrictions that the computation must complete within 15 sec-

onds of termination of input. Flink struggles increasing to 64 nodes, and then drops

consistently with 120 nodes. SAM continues to increase throughput through 120

nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

xii

7.5 This graph shows the ratio of found triangles divided by the number of expected

triangles given by Theorem 2. The rate for each method was the same as from

Figure 7.4. SAM stays fairly consistent throughout the runs, with an average of

0.937. For Flink, the ratio decreases as the number of nodes increases. The average

ratio starts at 0.873 with 16 nodes, and then decreases to an average of 0.521 with

120 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.6 We keep the first edges of queries (kq) with probabilities {0.01, 0.1, 1}. We also vary

|V | to range from 25,000 to 150,000. This figure is a plot of edges/second versus the

number of nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.7 We vary the rate at which the first edges of queries are kept (kq = keep queries)

and plot the edges per day versus the rate at which triangles occur. . . . . . . . . . . 105

Chapter 1

Introduction

Graphs, sets of entities interconnected with edges, are a popular and powerful method for

representing many problems. Cyber systems [109, 154], where machines communicate with each

other through network interfaces, naturally create interaction graphs over time. Additionally, social

networks [54, 140], the electrical grid [79], protein-protein interaction networks [193, 103], neuronal

dynamics [31], and many other domains are all naturally modeled as graphs. While there is a

rich history and theory behind graphs, analyzing their temporal nature is less well understood,

and the temporal dimension can be critical to analyzing and detecting patterns of interest. In

particular, activity within cyber systems can be naturally described as a graph with temporal

information on the edges. The machines in the network are the nodes, while communications

between machines form edges with several properties, one of them being the interval of time during

which the communication occurred.

Another dimension to consider is the data generation rate. For example, cyber data produced

on high speed networks allow for 100 Gb/s links and multiple-Tb/s switches, creating a vast stream

of data that is challenging to analyze. A category of research called streaming and semi-streaming

[144, 59] has produced algorithms with strict spatial and temporal constraints to deal with infinite

streams, but these operators have largely been narrowly customized to specific problems, and not

incorporated into a larger framework for data analysis. There is also a wealth of projects with

streaming APIs [12, 190, 188, 95, 172, 11, 78, 13, 10, 64], but a standardized, efficient way for

expressing streaming analysis has not arisen.

2

Figure 1.1: This dissertation is an intersection of four different research areas and presents a domainspecific language (DSL) to express streaming temporal graph and machine learning queries.

The main contribution of this dissertation is to combine within one domain specific language

(DSL) the ability to succinctly express

• graph operations,

• temporal patterns,

• machine learning pipelines, and

• streaming operators.

We call this DSL the Streaming Analytics Language or SAL. SAL can be applied to any stream of

data. The only requirement is that the stream be a sequence of tuples. However, to flesh out the

details of SAL, we focused our efforts on analyzing network traffic and detecting malicious behavior.

Network analysis has all the components previously discussed. Machines interacting with

each other on a network produce a graph, where the nodes are Internet Protocol endpoints (IP’s)

and the edges are communications between IP’s. Each communication has a start time and a

duration, fulfilling the temporal aspect of the problem. Additionally, there are communication

patterns among nodes, i.e. subgraphs, that are of interest and may indicate malicious behavior,

and these patterns have a temporal dimension to them that are vital to correctly identifying the

3

subgraph. For instance, say node A talks to node B at time t1 and B talks to C at time t2. You

might infer causality if t1 < t2, but the opposite is not true, even though the topological structure

is the same either way. There are several approaches that utilize the topological graph structure of

cyber interactions [91, 97, 53, 113, 178, 109, 191, 218, 206, 145] or the temporal dimension of the

communication [23, 53, 196, 109, 191, 85, 206, 164]. We seek to combine these ad-hoc approaches

within a unifying framework, namely SAL.

In terms of machine learning, there are many approaches that use machine learning to classify

network traffic. For example:

• botnet detection [138, 23, 84, 223, 198, 105, 184]

• intrusion detection [30, 185, 124, 167]

• classifying network traffic by application [101, 26, 146, 212, 57, 214]

• insider threat detection [134, 69, 92]

An overall theme behind these works is to extract meaningful features and then perform machine

learning, either in an unsupervised or supervised approach. The contribution of this work is to

express these myriad approaches within a single DSL and framework.

Finally, cyber sytems quickly become a streaming problem as the size of the network grows.

Data is generated with such speed that often a single pass over the data is all that can be accom-

plished. Additionally, tight spatial and temporal complexity constraints are required to keep pace

with data creation, and to not overwhelm limited computational resources. There are a number of

streaming operators that have been published [16, 72, 132, 46, 221, 14, 48, 16, 221, 46, 48, 42, 141].

SAL provides these streaming operators as first-class citizens in the language.

SAL can be categorized into two types of computations: vertex-centric and subgraph match-

ing. For vertex-centric, the data is partitioned by vertices in the graph, and calculations are

computed based upon the tuples belonging to the vertex. More formally, let T be an infinite se-

quence of tuples that have a source vertex and a destination vertex, i.e. each tuple is an edge with

4

vertices defined along with potentially other fields. Let V be the set of all vertices. Let Tv be the

sequence of tuples where t ∈ Tv ⇒ source(t) = v ∨ target(t) = v for some v ∈ V . We refer to

vertex-centric computations as any function f that is computed continuously over each sequence

Tv for some subset of V .

Subgraph matching is finding subgraphs that fit topological, vertex, and edge constraints.

Topological constraints define a subgraph isomorphism problem between a graph G = (V,E) and

a query graph H = (V ′, E′). The task is to find all subgraphs g ∈ G such that there is a bijection

f that maps the vertices and edges of g to the vertices and edges of H. Namely, let g = (V , E)

where V ⊆ V and E ⊆ E∩ (V × V ). Then for g to be isomorphic to H, there must exist a bijection

f : V → V ′ such that (a, b) ∈ E ⇒ (f(a), f(b)) ∈ E′. Vertex constraints define further requirements

on the properties of the vertices of g that must be fulfilled for a match. Often, features are derived

for vertices using vertex-centric computations, and these features are then used to further refine

the subgraph query. Edge constraints define requirements on the fields within an edge and between

edges. For example, one can compare the start time of one edge, start(e0) to the start time of

another edge, start(e1), to ensure temporal ordering such as start(e0) < start(e1). These two

components of SAL, vertex-centric and subgraph matching, provide an expressive base from which

to succinctly describe many of the analyses discussed earlier.

Besides the language itself as a contribution, we also present a scalable implementation that

translates SAL into C++ code that runs in parallel on a cluster of machines. We call this interpreter

the Streaming Analytics Machine, or SAM. The code for SAM is over 19,000 lines, and scales to

128 nodes or 2560 cores and can process one million edges per second from streaming netflow data.

The rest of this dissertation is organized as follows: Chapter 2 introduces SAL and the various

ways of expressing streaming computations. Chapter 3 gives example SAL programs to express a

variety of cyber use cases. Chapter 4 discusses related work. Chapter 5 covers the implementation

of SAM, and how SAL maps to SAM. Chapter 6 discusses performance achieved via vertex-centric

computations, both in terms of scalability and classifier accuracy. Chapter 7 presents our results

with subgraph matching. Chapter 8 concludes.

Chapter 2

The Streaming Analytics Language

In this chapter we present the Streaming Analytics Language (SAL), define the different

aspects of the language, and describe case studies and examples to illustrate the concepts. SAL is

a combination of imperative statements used for feature extraction and declarative, SPARQL-like

[166] statements, used to express subgraph queries.

The streaming aspect of this work introduces important constraints on the design of SAL

and what operations we allow to be expressed. We utilize streaming algorithms, which is an area

of research where the desire is to keep the space and temporal requirements to be polylogarithmic,

i.e. O((log n)k) for some k where n is the size of the sliding window over which computations are

performed [144]. Sometimes those constraints, in particular the temporal complexity, are relaxed

[130]. Several algorithms have been published within these constraints:

• K-medians [16]

• Frequent Items [72]

• Mean/Frequency Counts [132, 46, 221]

• Quantiles [14]

• Rarity [48]

• Variance [16, 221]

• Vector norms [46]

6

• Similarity [48]

• Count distinct elements [42, 141]

We expose these algorithms as first-class citizens in SAL.

The motivation behind the tight spatial bounds is that we are operating over an infinite

stream of data where the ingest rates prohibit storing the entirety of the data within a sliding

window. For example, a common structure of many vertex-centric SAL programs is represented in

Figure 2.1. There is a stream of edges that is partitioned into N streams representing a stream

for each vertex. Each vertex has o operators applied to it, each of which uses a window of length

n. Assume that the o operators utilize different aspects of the edge data, i.e. they don’t share the

same data, and that each element of the window requires b bytes. Non-streaming approaches would

require N × o × n × b bytes while streaming approaches require N × o × log(n) × b bytes. Given

example numbers of a million vertices, 10 operators, a window size of 5000, and 4 bytes per vertex,

non-streaming approaches would necessitate about 190 GBs while streaming approaches require

only 500 MBs. While 190GBs could reasonably fit within memory of a single system, it doesn’t

take too much additional complexity to make the program unable to reside within memory. Our

implementation of operators within SAL all have polylogarithmic spatial complexity.

Figure 2.1: This figure represents a common pattern for vertex-centric computations in SAL pro-grams. A stream of edges is partitioned into N vertices that are then operated on by o operatorsthat compute over a sliding window of length n items.

7

While some streaming algorithms are defined to operate on the entire stream, we focus on

the sliding window model, where only recent inputs contribute to feature calculation. We believe

the sliding window model is more appropriate for cyber data, where the environment is constantly

changing over time and threats evolve and signatures are dynamic. The sliding window model can

also be subcategorized into either a window defined by a particular time duration, or by the last

n items. Currently SAL only supports expressing windows over the last n items, but many of the

underlying algorithms are easily adapted to temporal windows, so it would not be difficult to add

support for temporal windows.

These streaming operators are specified with FOREACH statements, which have the fol-

lowing form:

<FeatureName> = FOREACH <StreamName> GENERATE <Operator>

For all logical streams contained by <StreamName>, the <Operator> is applied to each stream,

creating a time-varying feature indexed by each logical stream.

The operators that can be specified are listed below:

• ave: Returns the estimated average value over the sliding window using an exponential

histogram approach [47]. If not using default values, the operator has the form ave(f, n, k)

where f is the name of the field in the tuple, n is the size of the sliding window, and k is a

parameter of the exponential histogram model that determines the size of each bin in the

histogram.

• sum: Returns the estimated sum over the sliding window. It also uses an exponential

histogram and has the same form sum(f, n, k).

• var: The estimated variance. We again use the exponential histogram approach with the

same signature, var(f, n, k).

• topk: Returns the estimated top k most frequent items. We use an implementation of

Golab et al. [73]. It has the form topk(f, n, b, k) where f is the targeted field, n is the size

8

of the sliding window, b is the size of the basic window, and k is how many frequent items

to report.

• median: The estimated median value. Arasu and Manku [14] provide an approach for

estimating quantiles to find a median value.

• countdistinct: This creates an estimate on the number distinct elements. Metwally et al.

[141] and Clifford and Cosma [42] produced algorithms that can be used for this calculation.

Another related area of research is semi-streaming [144, 59], which is a relaxation from polylog

spatial requirements to O(N · polylog N) for graph algorithms where N is the number of vertices

in the graph. Many graph algorithms cannot be calculated on anything less then O(N · polylog N)

space, hence the need to relax the requirement. While we do not implement any semi-streaming

algorithms, it should be noted that many SAL vertex-centric programs have the general form of

a semi-streaming operation. For example, there are N different streams, one for each vertex. For

each of the N streams we compute a O(polylog n) spatial complexity streaming operation, where

n is the window size. Overall, the entire pipeline has spatial complexity O(N · polylog n), which

closely resembles the definition of semi-streaming.

We will introduce SAL with an example, provided by Listing 2.1, which specifies a query

to match a type of malicious pattern known as a “Watering Hole Attack”. In a Watering Hole

Attack, an attacker infects websites (e.g. via malvertisements) that are frequented by the targeted

community. When someone from the targeted community accesses the website, their machine

becomes infected and begins communicating with a machine controlled by the attacker. Figure 2.2

illustrates the attack as set of nodes and edges.

9

Listing 2.1: SAL Code Example

1 // Preamble Statements

2 WindowSize = 1000 ;

3

4 // Connection Statements

5 Netf lows = VastStream (” l o c a l h o s t ” , 9999 ) ;

6

7 // P a r t i t i o n Statements

8 PARTITION Netf lows BY SourceIp , DestIp ;

9 HASH SourceIp WITH IpHashFunction ;

10 HASH DestIp WITH IpHashFunction ;

11

12 // Def in ing Features

13 Top1000 = FOREACH Netf lows GENERATE topk ( DestIp , 100000 , 10000 , 1000 ) ;

14

15 // Subgraph d e f i n i t i o n

16 SUBGRAPH ON Netf lows WITH source ( SourceIp )

17 AND t a r g e t ( DestIp )

18 {

19 t rg e1 ba i t ;

20 t rg e2 c o n t r o l l e r ;

21 s t a r t ( e2 ) > end ( e1 ) ;

22 s t a r t ( e2 ) − end ( e1 ) < 10 ;

23 ba i t in Top1000 ;

24 c o n t r o l l e r not in Top1000 ;

25 }

In this SAL program there are five parts:

• Preamble Statements

• Partition Statements

10

Figure 2.2: Illustration showing the subgraph nature of a Watering Hole Attack. The target machineaccesses a popular site, Bait. Shortly after accessing Bait, Target begins communicating to anew machine, the Controller

• Connection Statements

• Feature Definition

• Subgraph Definition

Preamble statements allow for global constants to be defined that are used throughout the program.

In the above listing, line 2 defines the default window size, i.e. the number of items in the sliding

window.

After the preamble are the connection statements. Line 5 defines a stream of netflows called

Netflows. VastStream tells the SAL interpreter to expect netflow data of a particular format (we

use the same format for netflows as found in the VAST Challenge 2013: Mini-Challenge 3 dataset

[201]). Each participating node in the cluster receives netflows over a socket on port 9999. The

VastStream function creates a stream of tuples that represent netflows. For the tuples generated

by VastStream, keywords are defined to access the individual fields of the tuple, listed below. A

more complete definition can be found in the data description provided by VAST.

• SamGeneratedId: Each netflow is assigned a unique ID by SAM.

• Label: This field is for when a netflow has a label, such as when we are processing labeled

11

data, or when SAM assigns a label during testing. This is often used for supervised machine

learning where individual netflows are labeled as malicious or benign.

• TimeSeconds: The number of seconds since the epoch plus microseconds as a decimal

value.

• ParseDate: A date that is easier to read.

• DateTime: A numeric version of ParseDate

• IpLayerProtocol: The IP protocol number.

• IpLayerProtocolCode: The text version of the IP protocol.

• SourceIp: The source IP.

• DestIp: The destination IP.

• SourcePort: The source port.

• DestPort: The destination port.

• MoreFragments: Used when the flow is a long running data stream. Nonzero when more

fragments will follow.

• CountFragments: If nonzero, indicates that this record is not the first for the flow.

• DurationSeconds: An integer indicating the time in seconds that the flow existed.

• SrcPayloadBytes: The sum of bytes that are part of the payload coming from the source.

• DestPayloadBytes: The sum of bytes that are part of the payload coming from the

destination.

• SrcTotalBytes: The total bytes coming from the source.

• DestTotalBytes: The total bytes coming from the destination.

12

• SrcPacketCount: The total number of packets coming from the source.

• DestPacketCount: The total number of packets coming from the destination.

• RecordForceOut: Indicates that the record was pushed out before shutdown, and may

be incomplete.

There are several different standard netflow formats. SAL currently supports only one format,

the one used by VAST ([201]). Adding other netflow formats is a straightforward task. In fact,

SAL can easily be extended to process any type of tuple. Using our current SAL interpreter,

SAM (see Chapter 5), the process is to define a C++ std::tuple with the required fields and a

function object that accepts a string and returns an std::tuple. Once a mapping is defined from

the desired keyword (e.g. VastStream) to the std::tuple, this new tuple type can then be used

in SAL connection statements. The mapping is defined via the Scala Parser Combinator [175] and

is discussed in more detail in Chapter 5.

Following the connection statements is the definition of how the tuples are partitioned across

the cluster. Line 8 specifies that the netflows should be partitioned separately by SourceIp and

DestIp. Each node in the cluster acts as an independent sensor and receives a separate stream of

netflows. These independent streams are then re-partitioned across the cluster. In this example,

each node is assigned a set of Source IP addresses and Destination IP addresses. IP addresses are

assigned to nodes using a common hash function. The hash function used is specified on lines 9 and

10, called the IpHashFunction. This is another avenue for extending SAL. Other hash functions

can be defined and mapped to SAL constructs, similar to how other tuples can be added to SAL.

The process is to define a function object that accepts the tuple type and returns an integer, and

then map the function object to a keyword using the Scala Parser Combinator.

The next part of the SAL example program defines features. On line 13 we use the FORE-

ACH GENERATE statement to calculate the most frequent destination IPs across the stream

of netflows. The second parameter defines the total window size of the sliding window. The third

parameter defines the size of a basic window [73], which divides up the sliding window into smaller

13

chunks. The fourth parameter defines the number of most frequent items to keep track of. In this

case, the FOREACH GENERATE statement operates on a single stream. In Section 2.1 we

will examine how the same statement can operate on multiple streams.

Finally, lines 19 and 20 define the subgraph of interest in terms of declarative, SPARQL-like

statements [166] that have the form <node> <edge> <node>. Lines 21 and 22 define temporal

constraints on the edges, specifying that e1 comes before e2 and that from the end of e1 to the

start of e2, the total time is twenty seconds. Also, lines 23 and 24 define constraints on the vertices

based on the previously defined feature Top1000. Vertex bait should be a frequent destination

while controller should be infrequent.

We will cover other Domain Specific Languages in Section 4.2, but as a quick note, the

language probably closest to the non-subgraph portion of SAL is Pig Latin [158]. Pig Latin is an

imperative language developed as a procedural analog to SQL. The developers of Pig found that

imperative map-reduce [51] style approaches were popular for data intensive parallel tasks, but too

low-level and rigid, requiring significant development time and enabling low levels of reuse. Also,

the declarative syntax of SQL was unnatural to the data analysts. Pig Latin’s purpose is to provide

a high-level language that is easy to describe the flow of data over a set of operations, but that

uses more abstract, higher-level constructs than map-reduce.

Pig latin describes the flow of data over a set of operations, similar to SAL. However, given

SAL’s focus on streaming data instead of batch processing, the semantics of the two languages

diverge. The biggest difference is the STREAM BY statement. While Pig processes batches

of tuples, one of SAL’s main use case is to divide a never-ending stream of tuples into separate

substreams, to each of which a pipeline of operations is applied. Pig has a FOREACH GENER-

ATE statement, but it’s purpose is to transform a set of tuples into another set of tuples, whereas

for SAL, the semantics are to create a feature for each substream of data.

Concerning the graph-matching portion of SAL, we elected to use a declarative approach,

where the user describes the subgraph of interest rather than specifying a procedural query plan for

how the subgraphs are found. The language closest to our approach is SPARQL [166] (a recursive

14

acronym for SPARQL Protocol and RDF Query Language), which is query language over Resource

Description Framework (RDF) [1] data. In preliminary work [76] we mapped SPARQL code onto

parallel vertex-centric programming frameworks, which are discussed in more detail in Section 4.1.

From that experience we reasoned that it is easier to declare subgraphs of interest rather than create

the plan to find them. Depending on the order of execution, the computational requirements can

be orders of magnitude different. In our opinion, it is better to let a query optimizer handle the

organization of the subgraph query plan.

While we use some concepts of SPARQL, namely basic graph patterns that describes the

topological structure of the subgraph, there are aspects of SAL that significantly diverge from

SPARQL and necessitate a separate approach. The differences between the requirements of SAL

and SPARQL-related efforts are described in more detail in Section 4.6.

For the rest of this chapter we will do the following:

• Look at a case study of using SAL based on the problem of finding botnets in Section 2.1.

• Integrate with static vertex attributes in Section 2.2.

• Present temporal subgraph queries in Section 2.3.

2.1 Case Study

In this section, we examine one approach for creating a classifier for detecting botnets, namely

Disclosure [23]. This example will help elucidate how SAL can be used to create succinct represen-

tations of machine learning pipelines that previously were developed in an ad-hoc fashion. Also,

it will demonstrate what cannot be expressed by SAL if we want to remain within the bounds of

streaming calculations. Some of the features created by Disclosure require algorithms that do not

comply with the desired constraints of streaming and semi-streaming. As such we have decided to

not include those features within SAL at this time. If space is not a concern, SAL can always be

extended to use non-streaming approaches that do not have polylogarithmic space constraints. In

15

any case, our intent with SAL is to winnow the stream of data to something more manageable for

more intensive study. The full program is provided in Listing 2.2.

Listing 2.2: SAL Disclosure Program

1 Netf lows = VastStream (” l o c a l h o s t ” , 9999 ) ;

2 PARTITION Netf lows By SourceIp , DestIp ;

3 HASH SourceIp WITH IpHashFunction ;

4 HASH DestIp WITH IpHashFunction ;

5 VertsByDest = STREAM Netf lows BY DestIp ;

6 Top2 = FOREACH VertsByDest GENERATE topk ( DestPort , 1 0 0 0 0 , 1 00 0 , 2 ) ;

7 Se rve r s = FILTER VertsByDest By Top2 . va lue (0 ) + Top2 . va lue (1 ) > 0 . 9 ;

8 FlowsizeAveIncoming = FOREACH Serve r s GENERATE ave ( SrcTotalBytes ) ;

9 FlowsizeAveOutcoming = FOREACH Serve r s GENERATE ave ( DestTotalBytes ) ;

10 FlowsizeVarIncoming = FOREACH Serve r s GENERATE var ( SrcTotalBytes ) ;

11 FlowsizeVarOutcoming = FOREACH Serve r s GENERATE var ( DestTotalBytes ) ;

12 UniqueIncoming = FOREACH Serve r s GENERATE c o u n t d i s t i n c t ( SrcTotalBytes ) ;

13 UniqueOutgoing = FOREACH Serve r s GENERATE c o u n t d i s t i n c t ( DestTotalBytes ) ;

14 DestSrc = STREAM Serve r s BY DestIp , SourceIp ;

15 TimeLapseSeries = FOREACH DestSrc TRANSFORM

16 ( TimeSeconds - TimeSeconds . prev ( 1 ) ) : TimeDiff ;

17 TimeDiffVar = FOREACH TimeLapseSeries GENERATE var ( TimeDiff ) ;

18 TimeDiffMed = FOREACH TimeLapseSeries GENERATE median ( TimeDiff ) ;

19 DestOnly = COLLAPSE TimeLapseSeries BY DestIp FOR TimeDiffVar , TimeDiffMed ;

20 AveTimeDiffVar = FOREACH DestOnly GENERATE ave ( TimeDiffVar ) ;

21 VarTimeDiffVar = FOREACH DestOnly GENERATE var ( TimeDiffVar ) ;

Many botnets have one or a small set of command and control servers that issue commands

to a large number of infected machines, that then perform attacks such as distributed denial-of-

service, stealing data, spam [183], or even use the computational resources for Bitcoin mining [102].

Disclosure focuses on identifying command and control (C&C) servers within a botnet.

16

The first part of the Disclosure pipeline is to identify servers. They define servers as an IP

address where the top two ports account for 90% of the flows. Server definition is handled by

lines 5 - 7. Line 5 demonstrates the use of a STREAM BY statement. It logically divides the

input stream (Netflows) into multiple streams by a set of keys, which in this case is a single key,

DestIp. The FOREACH GENERATE statement on line 6 then operates on each of these

individual streams. It calculates an estimate of the top two ports that receive traffic for each

destination IP using the topk operator. We then follow that with a filter on line 7. The value(n)

function returns the frequency of the nth most frequent item (zero-based indexing). Netflows where

the destination IP has two ports that account for 90% of the traffic to that IP are allowed through.

Figure 2.3 demonstrates the flow graphically.

Figure 2.3: This figure demonstrates the use of the STREAM BY statement to logically divide asingle stream into multiple streams. The STREAM BY statement is following by a FOREACHGENERATE statement that creates estimates on the top two most frequent destination portsand their frequencies. The FILTER statement uses this information to downselect to IPs wherethe top two ports account for 90% of the traffic.

Once the servers have been determined, the authors of Disclosure hypothesized that flow size

distributions for C&C servers are distinguishable from benign servers. An example they give is that

C&C generally have a limited number of commands, and thus flow sizes are drawn from a small set

of values. On the other hand, benign servers will generally have a much wider range of values. To

detect these difference between C&C servers and benign servers, they create three different types

17

of features based on flow size:

• Statistical Features

• Autocorrelation

• Unique flow sizes

For the statistical features, they extract the mean and standard deviation for the size of incoming

and outgoing flows for each server. This is easy to express in SAL, and can be found on lines 8 -

11. For each IP in the set of servers, we use the FOREACH GENERATE statement combined

with either the ave operator or var operator.

The Disclosure authors also generate features using autocorrelation on the flow sizes. The

idea is that C&C servers often have periodic behavior that would be evident from an autocorrelation

calculation. For each server, the flow sizes coming in with each netflow can be thought of as a time

series. They divide this signal up into 300 second intervals and calculate the autocorrelation of

the time series. Unfortunately, we are not aware of a polylogarithmic streaming algorithm for

calculating autocorrelation. As such, we currently do not allow autocorrelation to be expressed

within SAL; mixing algorithms with vastly different spatial and temporal complexity requirements

would negate many of the benefits of the language. For a problem with N vertices, calculating

autocorrelation for the each vertex would require O(N · n) space where n is the length of the

window. Applying polylogarithmic streaming operators to N vertices would require O(N · log(n)k)

space, offering significant savings and calling into question the scalability of Disclosure to handle

large problems. That said, there is nothing preventing an extension of SAL for an autocorrelation

operation, just the documentation would need to be clear about the complexity.

The final set of features based on flow-sizes involves finding the set of unique flow sizes. The

hypothesis is that botnets have a limited set of messages, and so the number of unique flow sizes

will be smaller than a typical benign server. To find an estimate on the number of unique flow

sizes, one can use the countdistinct operator, as shown on lines 12 - 13. However, Disclosure goes

a step further than just the number of unique flow sizes. They create an array with the counts for

18

each unique element and then compute unspecified statistical features on this array. While this is

not exactly expressible in SAL, and having an exact answer would break our space constraints, one

could use TopK to get an estimate on the counts for the most frequent elements.

Besides features based on flow size, Disclosure also computes features focusing on the client

access patterns. The hypothesis is that all the bots accessing a particular C&C server will exhibit

very similar behavior, while the behavior of clients accessing benign servers will not be so uniform.

Disclosure defines a time series for each server-client pair by calculating the inter-arrival times

between consecutive connections. For example, if we had n netflows that occurred at times t0, t1,

... tn, then the series would be t1 − t0, t2 − t1, ...,tn − tn−1. To specify this time series in SAL,

one uses the TRANSFORM statement. The TRANSFORM statement allows one to transform

from one tuple representation to another.

For this example, we need to transform from the original netflow tuple representation to a

tuple that has three values: the SrcIp, DestIp, and the inter-arrival time. On line 14, we use the

STREAM BY statement to further split the stream of netflows into source-destination IP pairs.

Line 16 then follows with the the TRANSFORM statement that calculates the inter-arrival time.

Since SourceIp and DestIp are the keys defined by the STREAM BY statement, those values

are included by default as the first two values of the newly defined tuple. This part of the pipeline

is represented in the left side of Figure 2.4.

On line 16 we introduce the prev(i) function. The prev(i) function returns the value of the

related field i items back in time. So TimeSeconds.prev(1) returns the value of TimeSeconds in

the tuple that occurred just previous to the current tuple, thus giving us the inter-arrival time.

The colon followed by TimeDiff gives a label to the tuple value and can be referred to in later

SAL statements (e.g. line 17).

Now that we have the time series expressed in SAL, we can then add the feature extraction

methods that Disclosure performs on the inter-arrival times. Disclosure calculates the minimum,

maximum, median, and standard deviation. The median and standard deviation can be expressed

in SAL, and are found on lines 17 and 18.

19

Figure 2.4: This figure demonstrates the use of the TRANSFORM and COLLAPSE BYstatements to define parts of the Disclosure pipeline.

However, maximum and minimum are not currently supported in SAL. The reason is that

max and min provably require O(n) space where n is the size of the window when computing over

a sliding window [46]. As discussed previously, we made the design decision to limit operators

to be polylogarithmic so that when we compute over N vertices, the total space complexity is

O(N · log(n)k) instead of O(N · n). When computing over the entire data stream, to find the

max/min one can keep track of one number, but the sliding window adds complexity as the max/min

expires. A mixture between the two models, over the entire stream and sliding windows, might be

sufficient for creating features. For example, one could use the scheme presented below to estimate

the maximum value:

20

Here, we create a sequence of max values, max0, max1, ..., where each maxi is calculated

during the presentation of n consecutive items; however, only two values need to be stored at any

one time, maxi and maxi+1. When querying for the current max after x items, then the index of

the maximum value for item x is

f(x) =

0, if x < n

2 .⌊2nx− 1

⌋, otherwise.

(2.1)

where n is the size of the window. This scheme would allow for constant space requirements and

O(n) processing time; however, it remains to be seen if max/min features extracted in this manner

are useful in practical situations and we leave it as future work.

For the features derived from the time series to be applicable to classifying servers, we need

to combine the features from all the clients. To do so, we no longer use SourceIp as one of the

keys to split the data flow by using the COLLAPSE BY statement. The BY clause contains a

list of keys that are kept. Unspecified keys are removed from the key set. On line 19 DestIp is

kept while SourceIp is not specified, meaning that it is no longer used as a key to split the data

flow.

The semantics of the COLLPASE BY statement is the following: Let k+ be the tuple

elements that are kept by COLLAPSE, and let k− be the tuple elements that are no longer used

21

as a key. For each tuple t that appears in the stream S of data, let k+(t) be the subtuple with only

tuple elements from k+, and let k−(t) be the subtuple with only tuple elements from k−. Also, let

r(t) be the subtuple with remaining elements that are neither in k+(t) nor in k−(t). Define L to

be the set of unique k+(t) for all t ∈ S, i.e.

L =⋃t∈S

k+(t) (2.2)

For each l ∈ L, we define another set, Ml:

Ml =⋃

t∈S,k+(t)=l

k−(t) (2.3)

COLLAPSE BY creates a mapping for each set Ml, mapping the elements of Ml to the most

recently seen r(t) associated with each m ∈ Ml. These mappings can then be operated on by the

FOREACH GENERATE statement.

As an example, say we have the following DestIp, SourceIp pairs with the following values

for TimeDiffMed:

Netflow # DestIp SourceIp TimeDiffMed

i 192.168.0.1 192.168.0.100 1.5

i + 1 192.168.0.2 192,168.0.100 5

i + 2 192.168.0.1 192.168.0.101 2.1

i + 3 192.168.0.1 192.168.0.100 2

The following graphic illustrates the result of applying line 19 to the above data:

22

After processing tuple i, there will be one map, for destination IP 192.168.0.1, and it will

have one key-value pair, < 192.168.0.100, 1.5 >. After tuple i + 1, there will be two maps, one

for 192.168.0.1 and one for 192.168.0.2. The map for 192.168.0.2 will have the key-value pair

< 192.168.0.100, 5 >. After tuple i + 2, there will be another element in the map for 192.168.0.1,

namely < 192.168.0.101, 2.1 >. After tuple i + 3, in the map for 192.168.0.1, the value for key

192.168.0.100 is replaced with 2.

Once we have collapsed back to DestIp, we can then calculate statistics across the set of

clients for each server. The Disclosure paper does not specify which statistics are calculated, but

we assume standard operators such as average, variance, and median were used, all of which can

be expressed in SAL. Lines 20 - 21 provide some examples.

That concludes our exploration of how SAL can be used to express concepts from a real

pipeline defined in another paper. While there are some operations that cannot be expressed, e.g.

the autocorrelation and max/min features, most of the features could be expressed in SAL. We

believe SAL provides a succinct way to express streaming machine learning pipelines.

To use this pipeline for machine learning purposes, a SAL pipeline can have two operating

modes: training and testing. In the training mode, the pipeline is run against a finite set of data,

usually with labels, and any features created by the pipeline will be appended per input tuple.

This feature set along with the labels is used to train a classifier offline. The testing phase then

applies the pipeline to an online stream of data, transforming each input tuple into defined feature

set, and then applies the trained classifier to the feature set to assign a label.

2.2 Static Vertex Attributes

The previous section outlined features that can be defined and that depend upon the stream

of data to calculate them. However, there may exist attributes associated with vertices that remain

relatively static. A common use case we’ve found is knowing when a vertex is a local IP address

and when it is outside the firewall. This is helpful in defining subgraphs of interest, further refining

the meaning and constraining what is considered. For the Watering Hole Attack, we can add the

23

line

Loca l In f o = VertexIn fo (<connect ion in fo >)

and then within the subgraph section we can add

t rg in Local ;

ba i t not in Local ;

c o n t r o l l e r not in Local ;

This allows us to create the constraints that trg is behind the firewall while bait and controller

are outside the firewall, further clarifying the intent of the query.

The general format for in and not in is <node variable> <in|not in> <vertex info>

The in and not in check to see if the node variable is a defined key within the static vertex

attributes. We can think of the Static Vertex Attributes as a table with a primary key. Attributes

of the table are accessed via the following form: <Attribute> (<key>). <FieldName>. For

example, if we have field Type that informs what the purpose of an IP is, we can add the following

types of constraints:

Loca l In f o ( t rg ) . Type == ” Server ”

Loca l In f o ( pos ) . Type == ”POS”

Where the first line indicates that the vertex trg is a Server and the second line indicates that

the vertex pos is a POS or point-of-sale machine.

We will see more examples of Static Vertex Attributes when we walk through how SAL can

express common cyber queries in Chapter 3.

2.3 Temporal Subgraph Matching

While the previous section focused on the streaming operators expressed as imperative state-

ment in SAL, here we provide more details on subgraph pattern matching, which we later show

to be polynomial in the number of possible edges over a sliding window. Subgraph matching uses

declarative statements with a syntax inspired by SPARQL [166]. A SAL subgraph matching query

24

has the following pieces:

• Graph Definition: SAL defines a stream of tuples, e.g. Netflows, but in order to perform

subgraph matching, SAL needs to know what constitutes a vertex within the tuple. In

particular, we support directed graphs where each edge has a direction, emanating from a

source and terminating at a target. The definition of the graph is specified with a statement

of the form

SUBGRAPH ON <Stream> WITH source (<Source>) AND t a r g e t (<Target>) { . . . }

The tuple fields <Source> and <Target> must be of the same type. In our implemen-

tation there is a std::tuple that is mapped to a particular tuple type in SAL. The two

fields must be of the same type in the std::tuple. Within the curly braces, { ... } , the

following three types of statements are made.

• Edge Definition: These are statements that define the topological structure of the graph

and come in the form <node> <edge> <node>. Each element is a variable name for a

node or an edge. Variable names are alphanumeric but begin with an alphabetic character.

• Edge Constraints: Using the keywords defined for the tuple, one can form a constraint on

edges in terms of a comparison of two arithmetic expressions:

< EdgeConstraint >:= < Expr >< Comp >< Expr >

< Expr >:= < Term > | < Expr > + < Term > | < Expr > − < Term >

< Term >:= < Factor > | < Term > ∗ < Factor > | < Term > / < Factor >

< Factor >:=(< Exp >)|− < Factor > | < V ariable > . < Field > |

start(< V ariable >)|end(< V ariable >)| < RealNumber > | < String >

< Comp >:= < | > | <= | >= | ==

In order to evaluate correctly, the variable name must match a variable specified for an

edge in an Edge Definition statement.

25

• Vertex Constraints: One can also define constraints on vertices:

< V ertexConstraint >:= < node >< V ertexComp >< FeatureName > |

< node >< V ertexComp >< V ertexInfo > |

< V ertexInfo > (< node >). < FieldName >

< Comp > < RV ertexInfo >

< V ertexComp >:=in|not in

< RV ertexInfo >:= < String > | < Float > | < Integer >

Vertex constraints have three basic forms. The first form checks to see if a vertex is found

in a previously defined feature, e.g. bait in Top1000 checks to see if the vertex bait is

found in the feature Top1000. In order to evaluate correctly, the node variable must refer

to a vertex variable defined in an Edge Definition statement, the feature variable must refer

to a variable defined by a Feature Definition Statement and the key of the feature must

be of the same type as the vertex type. The second form is similar to the first, except it

checks to see if the vertex variable is a key in a defined VertexInfo table. The third form

accesses a field of a VertexInfo table with the vertex variable as a key and checks that the

it satisfies comparison operation.

With overall process for subgraph definition outlined, we now discuss the complexity of the

problem. We begin with a discussion on subgraph isomorphism and how that problem relates to

temporal subgraph matching.

Subgraph isomorphism, the decision problem of determining for a pair of graphs, H and G,

if G contains a subgraph that is isomorphic to H, is known to be NP-complete. Additionally, the

problem of finding all matching subgraphs within a larger graph is NP-hard [122]. However, when

we consider a streaming environment with bounds on the temporal extent of the subgraph, the

problem becomes polynomial. We introduce some definitions to lay the groundwork for discussing

complexity.

26

Definition 1. Graph: G = (V,E) is a graph where V are vertices and E are edges. Edges e ∈ E

are tuples of the form (u, v) where u, v ∈ V .

Definition 2. Temporal Graph: G = (V,E) is a temporal graph where V are vertices and E

are temporal edges. Temporal edges e ∈ E are tuples of the form (u, v, t, δ) where u, v ∈ V and

t, δ ∈ R and represent the start time of the edge and its duration, respectively.

Definition 3. Streaming Temporal Graph: A graph G is a streaming temporal graph when G

is a temporal graph where V is finite and E is infinite.

To express subgraph queries against a temporal graph, we need the notion of temporal con-

staints on edges. We previous introduced constraints on edges which have the general form of two

arithmetic expressions involving tuple fields, i.e.

< Constraint >:= < Expr >< Comp >< Expr >

Temporal constraints are simply constraints where the expressions involve the temporal fields of

the tuples. For convenience we define start() to return the time that the edge started in seconds,

and end() as the time that the edge ended. Otherwise, to obtain a value of a tuple field, one uses

the form < variable > . < field >, e.g. e1.T imeSeconds. Now we define Temporal Subgraph

Queries using temporal constraints.

Definition 4. Temporal Subgraph Query: For a temporal graph G, a temporal subgraph query is

composed of a graph H with edges of the form (u, v), where each of the edges may have temporal

constraints defined. The result of the query is all subgraphs H ′ ∈ G that are isomorphic to H and

that fulfill the defined temporal constraints.

Below is an example triangular temporal subgraph query:

27

Listing 2.3: Temporal Triangle Query

1 SUBGRAPH ON Netf lows WITH source ( SourceIp ) AND t a r g e t ( DestIp )

2 {

3 x1 e1 x2 ;

4 x2 e2 x3 ;

5 x3 e3 x1 ;

6 s t a r t ( e3 ) − s t a r t ( e1 ) <= 10 ;

7 s t a r t ( e1 ) < s t a r t ( e2 ) ;

8 s t a r t ( e2 ) < s t a r t ( e3 ) ;

9 }

The first part of the query with the variables x1, x2, and x3 defines the subgraph structure

H. In this example, we have defined a triangle. The second part of the query defines the temporal

constraints. Edge e1 starts before e2, e2 starts before e3, and the entire subgraph must occur

within a 10 second window from the start of the first edge to the start of the last edge. For temporal

queries where there is a maximum window specified, then the problem of finding all matching

subgraphs is polynomial. As far as we are aware, no one else has examined the complexity of

subgraph matching within a streaming setting with temporal constraints.

Theorem 1. For a streaming temporal graph G, let n be the maximum number of edges that arise

during a time duration of length ∆q. A temporal subgraph query q with d edges and maximum

temporal extent ∆q and computed on an interval of edges of length ∆q has temporal and spatial

complexity of O(nd).

Proof. For an interval of length ∆q, there are at most n edges. Assume each of those n edges

satisfy one edge of q. Then there are at most n potential matching subgraphs in the interval. Now

assume for each of those n potential matching subgraphs, each of the n edges satisfies another of

the edges in q. Then there are n2 potential matching subgraphs. If we continue for d iterations,

we have a max of nd matching subgraphs. Thus to find and store all the matching subgraphs, it

would require O(nd) operations and O(nd) space to store the results.

28

The point of Theorem 1 is to show that subgraph matching on streaming data is tractable, and

that over any window of time that is about the length of the temporal subgraph query, computing

over that window has polynomial complexity, as long as the number edges in queries has some

maximum value.

In this section we’ve introduced the main principles of expressing a subgraph query: defining

the source and target fields, edge expressions, edge constraints, and vertex constraints. In the next

chapter we will see examples of SAL applied to common patterns arising in cyber security.

Chapter 3

SAL as a Query Language for Cyber Problems

This chapter covers several cyber use cases and analyzes how they can be expressed as SAL

programs. To ensure broad coverage, we use Verizon’s Data Breach Investigations Report (DBIR)

of 2018 [203] as the basis for our discussion. The DBIR presents an overview of cyber security

incidents and breaches. They define an incident as a security event that compromises the integrity,

confidentiality or availability of an information asset. A breach is more substantial, and defined

as an incident that results in the confirmed disclosure, not just potential exposure, of data to an

unauthorized party. As of the 2018 report, Verizon had recorded 333,000 incidents and 16,000

breaches. They group these observed cyber attacks into nine different categories, namely

• Crimeware

• Cyber-Espionage

• Denial of Service

• Lost and Stolen Assets

• Miscellaneous Errors

• Payment Card Skimmers

• Point of Sale

• Privilege Misuse

30

• Web Applications

Verizon reports that 94% of incidents and 90% of breaches fall within the nine categories. They

have another category, Everything Else, for attacks that don’t fit the nine. We will discuss how

SAL could be used to define the patterns that comprise each category. However, we do not cover

Lost and Stolen Assets (e.g. a stolen laptop) and Payment Card Skimmers (e.g. a skimming device

physically implanted on an ATM), as these categories relate to physical security, which is out of

scope. Also, we do not cover Miscellaneous Errors, which largely deals with misconfigurations (e.g.

default passwords on a database).

For the code listings in this section, to avoid frequent repetition we assume that a Netflows

stream has already been created, and that the data has been partitioned. For an example see lines

5-10 of Listing 2.1.

3.1 Crimeware

The Crimeware category has instances of malware that were largely opportunistic in nature

and are financially motivated, but that didn’t fit a specific pattern. An example of this is the

ZeroAccess botnet [209]. The ZeroAccess botnet had its heyday in 2012 when it had over one million

infected hosts at its disposal. The motivations behind ZeroAccess were purely financial, with much

of the revenue coming from click fraud. The botmasters would send lists of urls of advertisements,

and the controlled machines would simulate browsers clicking on those advertisements. They would

then collect the ad revenue from the generated activity. At its height, researchers estimate that

ZeroAccess garnered around $100,000 a day from click fraud.

To detect click fraud in general, one could start with the hypothesis that the click-through

behavior of bots is different than humans. It is probably difficult to know what traffic constitutes a

click through, but we can monitor how clients interact with web servers. Humans interact with the

web in Poisson-like or bursty distributions [202], while a bot would likely create relatively constant

activity.

31

We begin by creating a definition of a web server in Listing 3.1, similar to our approach for

Disclosure in Section 2.1. However, we focus on IPs where the majority of the traffic is directed

towards port 80.

Listing 3.1: Definition of Web Servers

1 VertsByDest = STREAM Netf lows BY DestIp ;

2 Top1 = FOREACH VertsByDest GENERATE topk ( DestPort , 1 0 0 0 0 , 1 00 0 , 1 ) ;

3 Se rve r s = FILTER VertsByDest BY Top1 . va lue (0 ) > 0 . 6 ;

4 Se rve r s = FILTER Serve r s BY Top1 . key (0 ) == 80 ;

Also similar to the Disclosure use case, we use a STREAM BY statement to create logical

streams of source/destination pairs in Listing 3.2. We then find the inter-arrival times and calculate

statistics, again similar to Disclosure.

Listing 3.2: Calculating Inter-arrival Times

1 DestSrc = STREAM Serve r s BY DestIp , SourceIp ;

2 TimeLapseSeries = FOREACH DestSrc TRANSFORM

3 ( TimeSeconds − TimeSeconds . prev ( 1 ) ) : TimeDiff ;

4 TimeDiffVar = FOREACH TimeLapseSeries GENERATE var ( TimeDiff ) ;

5 TimeDiffMed = FOREACH TimeLapseSeries GENERATE median ( TimeDiff ) ;

6 TimeDiffAve = FOREACH TimeLapseSeries GENERATE ave ( TimeDiff ) ;

In contrast to Disclosure, we are interested in statistics about the clients, not the servers. We

want to categorize clients into two bins: bot-generated and human-generated behavior. As such we

collapse on the source IP, and calculate features based on the aggregates in Listing 3.3.

32

Listing 3.3: Calculating Time-based Client Statistics

1 SrcOnly = COLLAPSE TimeLapseSeries BY SourceIp FOR TimeDiffVar , TimeDiffMed ,

2 TimeDiffAve ;

3 AveTimeDiffVar = FOREACH SrcOnly GENERATE ave ( TimeDiffVar ) ;

4 VarTimeDiffVar = FOREACH SrcOnly GENERATE var ( TimeDiffVar ) ;

5 AveTimeDiffMed = FOREACH SrcOnly GENERATE ave ( TimeDiffMed ) ;

6 VarTimeDiffMed = FOREACH SrcOnly GENERATE var ( TimeDiffMed ) ;

7 AveTimeDiffAve = FOREACH SrcOnly GENERATE ave ( TimeDiffAve ) ;

8 VarTimeDiffAve = FOREACH SrcOnly GENERATE var ( TimeDiffAve ) ;

From these generated features, one could use a clustering-based approach to understand the

data. In an ideal world, there would be one cluster for human-generated traffic and another for

bot-generated traffic. However, it is more likely that an iterative process is required. In the case of

the ZeroAccess bot, it was a peer-to-peer-based bot, and communication between peers used four

particular ports: 16464, 16465, 16470, and 16471. In the early phases of analyzing data, this kind

of peculiar communication pattern might become apparent from clustering on a first pass, and one

can add a filter to look for that specific pattern as in Listing 3.4.

Listing 3.4: Suspicious Ports

1 VertsByDest = STREAM Netf lows BY DestIp ;

2 Top10Ports = FOREACH VertsByDest GENERATE topk ( DestPort , 10000 , 1000 , 10 ) ;

3 BadClients = FILTER VertsByDest BY 16464 in Top10Ports ;

4 BadClients = STREAM BadClients BY SourceIp ;

5 Top1 = FOREACH VertsByDest GENERATE topk ( DestPort , 1 0 0 0 0 , 1 00 0 , 1 ) ;

6 Se rve r s = FILTER VertsByDest BY Top1 . va lue (0 ) > 0 . 6 ;

7 Se rve r s = FILTER Serve r s BY Top1 . key == 80 ;

8 Se rve r s = FILTER Serve r s BY SourceIp in BadClients

On line 4, we switch the key to stream by from DestIp to SourceIp so that on line 8 we

can test if the SourceIp of the Servers traffic is in BadClients. At the end, Servers is filtered

down to netflows where the destination IP is characterized as a web server and the client IP has

33

generated traffic on port 16464.

The ZeroAccess bot is a good example of how SAL could be used to test out hypotheses. One

starts with an idea of what click fraud look likes from network behavior, and then we can begin to

extract useful data and calculated features that allow researchers to test their hypotheses.

3.2 Cyber-Espionage

According to the Verizon report, this category involves malicious activity linked to state-

affiliated actors and/or exhibiting the motive of espionage. They also state that 93% of

breaches are attributed to state-affiliated groups or nation-states. In another report, Symantec [195]

advises that number of targeted attack groups, i.e. groups that are professional, highly organized,

and target specifically rather than indiscriminately, grew at a rate of 29 groups a year between the

years of 2015 to 2017, from a total of 87 to 140. The evidence points to growing sophistication and

technical capabilities of adversarial groups.

Figure 3.1: A subgraph showing the flow of messages for data exfiltration. A controller sends asmall control message to an infected machine, which then sends large amounts of data to a dropbox.

A common activity of cyber-espionage is the exfiltration of data. Joslyn et al. [109] provide an

exfiltration query that we express in terms of SAL. Figure 3.1 presents the overall graph structure.

Listing 3.5 gives the SAL code. We assume a Static Vertex Attributes table named Local has

34

been defined and is a single column with IP addresses. The presence of the IP address in the table

implies that the address is behind the firewall. Lines 4 and 5 defines the topology of the graph.

The temporal constraints of lines 6 and 7 define the temporal ordering, that e1 happens before e2,

and that time between the end of e1 and the start of e2 should be less than ten seconds. Lines

8 and 10 specify that drop and controller are not local, while target is a machine behind the

firewall. Finally, line 11 states that the number of bytes coming from the controller to the target is

relatively small (< 20 bytes) while the data going from target to dropbox is large (> 1000 bytes).

Listing 3.5: Exfiltration Query

1 SUBGRAPH ON Netf lows WITH source ( SourceIp )

2 AND t a r g e t ( DestIp )

3 {

4 c o n t r o l l e r e1 t a r g e t ;

5 t a r g e t e2 dropbox ;

6 end ( e1 ) < s t a r t ( e2 ) ;

7 s t a r t ( e1 ) − end ( e2 ) < 10 ;

8 t a r g e t in Local ;

9 drop not in Local ;

10 c o n t r o l l e r not in Local ;

11 e1 . SrcPayloadBytes < 20 ;

12 e2 . SrcPayloadBytes > 1000 ;

13 }

3.3 Denial of Service

Denial of Service is an attack where the intent is to deny intended users access to a networked

resource. Often, Denial of Service is a distributed undertaking, with large botnets employed to

send enormous volumes of traffic to the victim, overwhelming resources so that legitimate users are

unable to reach the site. This is known as Distributed Denial of Service, or DDoS. Jonker

35

et al. [108] note that one third of all active /24 networks experienced DDoS attacks over a two

year period. Attacks have been seen passing 1 terabit per second, with an attack against Github

peaking at 1.35 Tbps [56]. DDoS occurs with alarming frequency, with Jonker at al. reporting

about 30,000 attacks per day.

The primary means of conducting a DDoS is through a botnet. There are even DDoS-as-a-

Service offerings that allow you to rent a botnet for a DDoS attack [173]. One structure for botnets

is the command and control variant (C&C). Below we list a potential query to identify a subgraph

associated with denial of service. There is one controller node that sends out a message to multiple

infected machines, which then send messages to a target node. The average size of the messages

of the infected machines is small (e.g. a SYN flood). This query is a subgraph of the intended

behavior, where we have many infected machines.

36

Listing 3.6: Denial of Service Query

1 VertsBySrc = STREAM Netf lows BY SourceIp ;

2 FlowsizeAve = FOREACH VertsBySrc GENERATE ave ( SrcPayloadBytes ) ;

3 SmallMessages = FILTER VertsBySrc BY FlowsizeAve < 10 ;

4 SUBGRAPH ON Netf lows WITH source ( SourceIp ) AND t a r g e t ( DestIp )

5 {

6 c o n t r o l l e r e1 node1 ;

7 c o n t r o l l e r e2 node2 ;

8 node1 e3 t a r g e t ;

9 node2 e4 t a r g e t ;

10 s t a r t ( e1 ) < s t a r t ( e2 ) ;

11 s t a r t ( e2 ) < s t a r t ( e3 ) ;

12 s t a r t ( e3 ) < s t a r t ( e4 ) ;

13 s t a r t ( e4 ) − s t a r t ( e1 ) < 10 ;

14 node1 in SmallMessages ;

15 node2 in SmallMessages ;

16 }

3.4 Point of Sale

Point of Sale (POS) intrusions cover attacks that are conducted remotely but ultimately

target POS terminals and controllers. During a POS transaction, communication of the sensitive

information is largely encrypted. The vulnerable time when credit card and other information is

not encrypted is when it is loaded into memory of a POS terminal. Thus, most POS-associated

malware scans the memory of running processes and performs a regular expression search for credit

card information, which is then aggregated and communicated back to the attacker.

The Target Breach [180], which occurred in 2013, was one of the first large-scale POS breaches,

where 40 million card credentials and 70 million customer records were stolen. While each POS

breach has some unique elements, in particular how the POS malware is distributed across a

37

retailer’s network, the general pattern of how data is collected and communicated back to the

attacker has many commonalities:

• POS malware skims the memory of running processes, using regular expressions to discover

credit card information.

• Found credential information is sent to a server. In the case of the Target Breach, a small

set of compromised servers behind Target’s firewall aggregated the information from a

large number of POS terminals. The compromised servers sent the aggregated credit card

information to an external address controlled by the attacker.

A basic SAL query that could cover this pattern is provided in Listing 3.7. In this example,

we have a Static Vertex Attributes table name Local that has two columns, the key column with

the IP addresses, and another column with label type. This allows us to check if an IP is behind

the firewall, and if it is a server or a POS machine.

Listing 3.7: POS Subgraph Query

1 SUBGRAPH ON Netf lows WITH source ( SourceIp )

2 AND t a r g e t ( DestIp )

3 {

4 pos e1 s e r v e r ;

5 s e r v e r e2 drop ;

6 Local ( pos ) . type == ”POS” ;

7 Local ( s e r v e r ) . type == ” Server ” ;

8 drop not in Local ;

9 }

3.5 Privilege Misuse

The privilege misuse category largely deals with insider threat. A brochure from the FBI

[58] lists several examples of insider threats, with a common theme of collecting many documents,

38

electronically or physically, sometimes on the order of hundreds of thousands. One pattern that

could be considered suspicious is downloading/accessing many documents within a short period

of time. For this query, we have a different graph than what we’ve used previously. We need

enhanced semantic information than what is usually provided by netflows. Namely we need a way

of identifying document accesses by individuals. This requires that we have authentication in place

to determine who is generating the traffic, and that we can identify when a particular document

is being accessed. For many companies and institutions, this type of infrastructure is already in

place.

For the purposes of discussion, we will assume that there are tuples generated with the given

fields:

• PersonId: the identifier of the person accessing the document.

• DocumentId: the identifier assigned to the document.

• TimeSeconds: the time in seconds from epoch that the access occurred.

We refer to this stream of tuples as a DocumentStream. We will assume that there is a hash

function defined that will partition the data by PersonId. The SAL program in Listing 3.8 finds

the average time between document accesses, and creates a filtered list of users that have average

time between document accesses less than 0.1 seconds.

Listing 3.8: Malicious Insider Query

1 Documents = DocumentStream (” l o c a l h o s t ” , 9999 ) ;

2 PARTITION Documents BY PersonId ;

3 HASH PersonId WITH PersonHashFunction ;

4 ByPerson = STREAM Documents BY PersonId ;

5 TimeLapseSeries = FOREACH ByPerson TRANSFORM

6 ( TimeSeconds − TimeSeconds . prev ( 1 ) ) : TimeDiff ;

7 TimeDiffAve = FOREACH TimeLapseSeries GENERATE ave ( TimeDiff ) ;

8 Downloaders = FILTER TimeLapseSeries BY TimeDiffAve < 0 . 1 ;

39

3.6 Web Applications

Web application attacks are defined as any incident in which a web application was the vector

of attack. An attack that fits this category is that of Cross-site Scripting (XSS). A XSS attack is

possible when a web site accepts unsanitized user input that is then displayed back on the website.

For example, if a site accepts comments, a malicious actor can insert java script code that is then

added to the HTML page as a comment. When other users visit the compromised url, the malicious

javascript is run, opening the door for theft of cookies, eventlisteners that register keystrokes, and

phishing through the use of fake/modified logins.

Figure 3.2: A subgraph showing an XSS attack.

The general pattern is similar to a Watering Hole Attack. Here a client browser accesses a

frequent site or server. The downloaded HTML has malicious javascript embedded, which causes

the client browser to access a relatively rare site and upload data to the attacker controlled site.

Figure 3.2 presents the subgraph of interest. However, we can add additional details to make the

query more specific to XSS. Listing 3.9 provides a query for XSS attacks in SAL. Lines 1 - 3 create

a definition of a server, similar to the Disclosure query. On line 4 we create a reference to a Static

Vertex Attributes table, DomainInfo. DomainInfo has two columns, the key column being IP,

and another column being the associated domain. This allows us to test that the domain of the

server or frequent site is not equal to the domain of the malicious IP (line 12).

40

It is also important to point out are lines 15 and 16, which accesses the application protocol

being used. Most netflow formats do not include application protocol information; however, for

some of the more common protocols (e.g. HTTP, HTTPS, SMTP) tools like Wireshark [208] are

relatively good at decoding application layer protocols. Thus it is conceivable that the netflow

tuples could be augmented with common application level protocols.

Listing 3.9: XSS Attack Query

1 VertsByDest = STREAM Netf lows BY DestIp ;

2 Top2 = FOREACH VertsByDest GENERATE topk ( DestPort , 1 0 0 0 0 , 1 00 0 , 2 ) ;

3 Se rve r s = FILTER VertsByDest BY Top2 . va lue (0 ) + Top2 . va lue (1 ) > 0 . 9 ;

4 DomainInfo = VertexIn fo (” domainurl ” , 5000 ) ;

5

6 SUBGRAPH ON Netf lows WITH source ( SourceIp ) AND t a r g e t ( DestIp )

7 {

8 c l i e n t e1 f r equent ;

9 c l i e n t e2 ma l i c i ou s ;

10 f r equent in Serve r s ;

11 c l i e n t not in Serve r s ;

12 DomainInfo ( f r equent ) . domain != DomainInfo ( ma l i c i ou s ) . domain

13 s t a r t ( e2 ) s t a r t ( e1 ) <= 10 ;

14 s t a r t ( e1 ) < s t a r t ( e2 ) ;

15 e1 . App l i ca t i onProtoco l == ” http ”

16 e2 . App l i ca t i onProtoco l == ” http ”

17 }

3.7 Summary

In this chapter we’ve applied SAL to a broad range of cyber problems, using Verizon’s Data

Breach Investigations Report as the basis for exploring coverage. SAL covers a wide range of use

cases within the cyber realm.

Chapter 4

Related Work

Given that we are combining aspects of graph theory, machine learning, and streaming into

one domain specific language, there is a large body of related work.

In terms of graph research, there are several different related-subareas:

• Vertex-centric Computing: This a popular way to express parallel graph operations based

on the Bulk Synchronous Parallel paradigm [200]. While a useful approach for batch

processing of static graphs, its application to streaming graph computations is less relevant.

We discuss this avenue of research in Section 4.1.

• Graph Domain Specific Languages: A number of DSLs have been developed to express

graph algorithms. However, these DSLs largely target shared-memory architectures, op-

erate on static graphs, and do not have the ability to express machine learning pipelines.

Graph DSLs are discussed in Section 4.2.

• Linear Algebra-based Graph Systems: Many graph algorithms can be expressed as compu-

tations on matrices and as such, systems have been developed that adopt this approach.

These approaches are discussed in Section 4.3.

• Graph APIs: In Section 4.4 we discuss other libraries that tackle graph computations, but

aren’t necessarily DSLs.

• Graph computations using GPU: In Section 4.5 we discuss approaches that target the GPU

for graph computations.

42

• SPARQL: The semantic web [21] is a graph with a rich set of edge types. As such, the

query language for the semantic web, SPARQL, has as its basis basic graph patterns

that are essentially subgraphs. SPARQL influenced the design of SAL. Related efforts are

discussed in section 4.6.

Machine learning is a rich and active research area. We discuss common frameworks in

Section 4.7. Data Stream Management Systems (DSMS) are designed to handle streaming data as

opposed to static data. The initial approaches were largely ports of relational database concepts to

streaming data. Lately a plethora of projects with distributed APIs for handling streaming have

become available, largely through Apache projects. These efforts are discussed in Section 4.8. We

also discuss Network Telemetry in Section 4.9, which is an area dealing with network monitoring

using network infrastructure devices such as switches.

4.1 Vertex-centric Computing

The idea of vertex-centric programing was popularized by Google with their paper on Pregel

[131], a distributed graph platform with operations similar to MapReduce. Vertex-centric pro-

gramming is based on Valiant’s Bulk Synchronous Parallel (BSP) [200] paradigm. BSP prescribes

a method for organizing computation in order to both predict parallel performance, and to simplify

the task of parallelizing algorithms. There are three phases:

(1) Concurrent computation: Each node performs computation independent of all other nodes.

Memory access occurs only from the nodes’ local memory.

(2) Communication: The nodes communicate with each other.

(3) Barrier Synchronization: All nodes are stalled until the communication phase is over.

Algorithms described in terms of vertex-centric computation have a similar pattern. For

example, consider the PageRank algorithm. Each vertex is assigned an initial PageRank value, and

then the algorithm iterates until the PageRank values stabilize:

43

(1) Each vertex takes its current PageRank score, divides by the number of outgoing edges,

and then sends that value to all of its neighbors. This is the communication phase of BSP.

(2) Progress halts until all vertices have received messages with updated page rank values from

their neighbors. This is the synchronization phase of BSP.

(3) Each vertex re-computes the PageRank score based upon the updated values from its

neighbors. This is the concurrent computation phase of BSP.

In essence, vertex-centric programming is BSP where the communication phase is limited to nodes

connected by edges in the graph.

Since Pregel was developed, a number of other platforms have adopted the vertex-centric

programming model. Below is a list:

• Giraph is the open source Apache version of Pregel [93].

• GraphLab [127, 126] and PowerGraph [74] arose from the same group at Carnegie Mellon.

GraphLab started as a shared-memory implementation. PowerGraph moved the approach

to a distributed platform and used the vertex-centric approach for parallelization. Pow-

erGraph uses MPI as the underlying mechanism for parallelization and is restricted to

in-memory computations with no fault tolerance. The latest instantiation, GraphLab Cre-

ate, backs away from enabling custom algorithm development, but instead defines a python

interface backed by C++ code. GraphLab Create added support for out-of-core computa-

tions.

• GraphX [210] extends Spark’s Resilient Distributed Dataset (RDD) to form a new abstrac-

tion called the Resilient Distributed Graph (RDG). While not strictly vertex-centric, they

show through their API calls to updateVertices and aggregateNeighbors that they

can implement the patterns defined by Pregel and PowerGraph. Because GraphX is built

on top of Spark, it is distributed and the RDG mechanism provides fault-tolerance.

44

• Grace [165] is an in-memory graph management system. It is relatively unique in that

it includes support for transactions and ACID semantics. Besides an API for performing

updates to the graph (e.g. vertex or edge addition or deletion), the programming paradigm

is also vertex-centric.

• Graph Processing System (GPS) [171, 99] is Pregel-like distributed system which added

these features: 1) API support for global computations, 2) dynamic repartitioning, and 3)

distribution of high-degree vertices. In [99] they add Green-Marl [98] as a domain specific

language that is then mapped to GPS’s code base. Green-Marl is a DSL for expressing

graph algorithms, which we discuss in more detail in Section 4.2.1.

• GraphChi [119] examines out-of-core graph computations on a single node by performing

operations on sets of edges that are stored sequentially on disk. Disabling random access

increases the number of edges read from disk, but the speed of the sequential access more

than compensates.

• CuSha [116] is a vertex-centric graph processing system that is CUDA-based. To overcome

the problem of irregular memory accesses, resulting in poor GPU performance, they use

two graph representations: G-Shards andConcatenated Windows. G-Shards uses the shard

concept from GraphChi [119] to organize edges into ordered sets for memory access patterns

better suited for GPU computations. Concatenated Windows is a modification of G-Shards

for sparse graphs. The size of a shard in G-Shards is inversely related to sparsity: the

sparser the graph, the smaller the shard size on average. This can lead to GPU under-

utilization. Concatenated Windows combines shards into the same computation window.

• GraphGen [156] is an FPGA approach for vertex-centric computations. It takes a vertex-

centric graph specification and compiles that into the targeted FPGA platform. They show

speedups between 2.9x and 14.6x faster than CPU applications.

• X-Stream [170] can perform both in-memory and out-of-core computations on graphs, but

45

only for a single shared-memory machine. The novelty of this approach is that programming

paradigm is edge-centric rather than vertex-centric. By adopting an edge-centric model,

they can avoid some of the overhead incurred by GraphChi [119], which must pre-sort its

shards by source vertex.

• VENUS [38] introduces optimizations to the original out-of-core approach of GraphChi

[119], such as handling the structure of the graph, which doesn’t change, differently from

vertex values, which change over the computation.

• Pregelix [29] was one of the first distributed approaches that allowed out-of-core workloads

for vertex-centric programming. They employ database-style storage management and map

vertex-centric programs to a query plan of joins and computations on tables.

• CHAOS [169], another distributed out-of-core approach, is a generalization of the X-Stream

approach [170] for a distributed setting. The main assumption behind Choas is that the

network is not the bottleneck, but instead reading from disk. Thus if a node needs a shard

from a remote node, the latency is about the same as reading from a local drive. They

show weak scaling results to 32 nodes, processing a graph 32 times larger than on one node,

but only taking 1.61 times longer.

• PowerLyra’s [37] contribution is to recognize that processing low-degree vertices should

be done differently than high-degree vertices in the gather-apply-scatter approach (BSP).

High-degree vertices have their edges partitioned across the cluster, while low degree vertices

are contained entirely within one node.

• GraphD [211] is also another distributed out-of-core approach. They show significant im-

provement over Pregelix and CHAOS for certain classes of problems. Part of the reason

behind the speedup is GraphD’s ability to eliminate external-memory joins by using mes-

sage combiners.

• FlashGraph [219] designs its vertex-centric computations around an array of solid-state

46

drives (SSDs). Vertex information is kept in memory, while edges are read-only and reside

on the SSDs. Their performance compares favorably to in-memory approaches, and they

are able to compute on a 129 billion edge graph on a single node.

• G-store [118] is another vertex-centric approach utilizing flash drives. They outperform

FlashGraph by up to 2.4× and can run some algorithms on a trillion edge graph in tens of

minutes. They improve the cache hit ratio on multi-core CPUs using a tile-based physical

grouping on disks. They also use a slide-cache-rewind approach where G-Store tries to

utilize data that has already been fetched.

Table 4.1 provides a summary of all the vertex-centric methods discussed above. It shows

which approaches run on only a single node or if an approach can run on multiple nodes in a

distributed setting. Also, it shows which frameworks are in-memory only and which can utilize

out-of-core or computations on flash drives.

Approach Single Node Multiple Nodes In-memory Only Out-of-core FlashPregelGiraph

GraphLabPowerGraph

GraphXGPS

GraphChiX-Stream

VenusPregelixCHAOS

PowerLyraGraphD

FlashGraphG-Store

Table 4.1: A table of vertex-centric frameworks and various attributes.

None of the above methods provide any native support for temporal information. However,

it would be possible to encode temporal information on the edges (most provide support for defin-

ing arbitrary data structures on the edges and vertices). Then it would be incumbent upon the

47

programmer to properly use this information to implement various forms of the temporal logic.

The biggest impediment to using vertex-centric computing for a streaming is that it is by nature

a batch process. Adapting it to a streaming environment would likely entail creating overlapping

windows of time over which to perform queries in batch. Of course, SAL is a much higher level of

abstraction than any of the interfaces provided by these frameworks.

4.2 Graph Domain Specific Languages

A major contribution of this work is the presentation of a domain specific language for

streaming temporal graph computations that allow succinct algorithmic descriptions. In this section

we give an overview of other domain specific languages that specifically target graph computations.

We discuss the following works:

• Green-Marl [98]

• Ligra [181]

• Galois [152]

• Gemini [220]

• Grazelle [83]

• GraphIt [217]

A limitation for most of these DSLs, except for Gemini, is that they target shared memory

machines. While some of the concepts could be generalized to distributed memory approaches,

their current implementations do not include support for distributed settings. Another work, called

Gluon [49], is an approach that can be used to convert existing shared-memory infrastructures into

distributed frameworks. Gluon provides a lightweight API that allows an existing shared memory

graph system to partition the graph across a cluster. The primary requirement is that the shared

memory graph system provides a vertex-centric programming model (discussed in Section 4.1). The

48

creators of Gluon provide a performance evaluation of Ligra, Galois, and IrGL (a GPU, single-node

graph system) [159] that have been extended to a distributed setting, called D-Ligra, D-Galois, and

D-IrGL. Overall they show that the D-Galois and D-IrGL outperform Gemini.

Regardless of whether these DLS approaches can be adjusted to a distributed setting, a

fundamental difference between this set of work and our own is the streaming aspect of our approach

and domain. An underlying assumption of the approaches in this section is that the data is static.

The algorithmic solutions, partitioning, operation scheduling, and underlying data processing rely

upon this assumption.

4.2.1 Green-Marl

Green-Marl [98] attempts to increase programmer productivity through the use of high level

constructs that can be used to succinctly describe graph algorithms. One of these constructs is

traversals over the graph, either breadth-first or depth-first. Below is their implementation of

betweenness centrality using Green-Marl. They iterate over each node and perform breadth-first

search from the node to find the number of shortest paths through each node. They make use of

InBFS and InRBFS, which iterate through nodes in breadth-first and reverse breadth-first order,

respectively.

Listing 4.1: Green-Marl Betweenness Centrality

1 Procedure Compute BC(G: Graph , BC: Node Prop<Float>(G) ) {

2 G.BC = 0 ; // i n i t i a l i z e BC

3 Foreach ( s : G. Nodes ) {

4 // d e f i n e temporary p r o p e r t i e s

5 Node Prop<Float>(G) Sigma ;

6 Node Prop<Float>(G) Delta ;

7 s . Sigma = 1 ; // I n i t i a l i z e Sigma f o r root

8 // Traverse graph in BFS−order from s

49

9 InBFS( v : G. Nodes From s ) ( v!= s ) {

10 // sum over BFS−parents

11 v . Sigma = Sum(w: v . UpNbrs ) {w. Sigma } ;

12 }

13 // Traverse graph in r e v e r s e BFS−order

14 InRBFS( v!= s ) {

15 // sum over BFS−c h i l d r e n v . Delta = Sum (w: v . DownNbrs) {

16 v . Sigma / w. Sigma ∗ (1+ w. Delta )

17 } ;

18 v .BC += v . Delta @s ; // accumulate BC

19 }}}

When BFS and DFS are not sufficient to express a desired computation, the framework allows

you to access the vertices and edges and write the for-loops yourself. For many situations, Green-

Marl provides implicit parallelism, with the compiler automatically parallelizing loops. Besides

the BFS and DFS traversals and implicit parallelism, Green-Marl also provides various collections,

deferred assignment (allows for bulk synchronous consistency), and reductions.

4.2.2 Ligra

Ligra [181] has more flexible mechanisms than Green-Marl for writing graph traversals. While

Green-Marl only allows BFS and DFS, Ligra can operate over arbitrary sets of vertices on each

iteration. This allows it to naturally and succinctly express algorithms for radii estimation and

Bellman-Ford shortest path, which do not use BFS or DFS traversals.

4.2.3 Galois

The researchers behind Galois [152] argue that BSP-style computations are insufficient for

performance across a variety of input graphs and graph algorithms. They present a light-weight

50

infrastructure that can represent the APIs of GraphLab [127], GraphChi [119], and Ligra [181].

Galois also provides fine-grained scheduling mechanisms through the use of priorities per task.

4.2.4 Gemini

Gemini [220] is a graph DSL similar to Ligra, but with an implementation that addresses

shortcomings of graph computations in a distributed setting. In McSherry et al. [139], the authors

note that a single threaded approach beats many scalable distributed graph frameworks. Gemini

adopts several approaches to make scalability efficient for graph computations:

• Adaptive push/pull vertex updates depending upon sparsity/density of active edge set:

When there sparse/few active edges, pushing updates to vertices along outgoing edges is

generally more efficient while for dense/many active edges, it is more efficient to pull vertex

updates from neighboring vertices.

• Chunk-based partitioning: They try to capture locality within the graph structure by using

a simple scheme for creating contiguous chunks of vertices.

• CSR and CSC: Similar to SAM, they index edges using compressed sparse row and com-

pressed sparse column format. However, they employ bitmaps and a doubly compressed

sparse column approach to reduce the memory footprint of both data structures.

• NUMA-aware: They apply their chunk-based partitioning recursively in an attempt to

factor in NUMA-characteristics of today’s multicore architectures.

• Work-stealing: While the overall computation is a bulk-synchronous parallel computation,

they employ types of work-stealing within a node to better load balance computation

intranode.

With these optimizations, Gemini is able to outperform other distributed graph implementations

between 8.91× and 39.8× on an eight node cluster connected with 100 Gbps Infiniband.

51

4.2.5 Grazelle

Grazelle [83] focuses on pull-based graph algorithms where an outer loop iterates over desti-

nation vertices and an inner loop over in-edges. They optimize this pattern with a scheduler-aware

interface for the parallel loops that allow each thread to execute consecutive iterations and a

vector-sparse format that overcomes some of the limitations of compressed-sparse data structures,

particularly with unaligned memory accesses that occur with low-degree vertices. They compare

their performance on single node against four other graph systems with improvements ranging

between 4.6× and 66.8×.

4.2.6 GraphIt

GraphIt [217] is another domain specific language for graph applications. They make the

argument that computations on graphs are difficult to reason about. For example, the performance

of an implementation of a graph algorithm can vary widely depending on the structure of the under-

lying graph. Their DSL, GraphIt, separates the what (the end result) from the how (the schedule

of operations). This allows programmers to specify a correct algorithm, and then separately tinker

with the specific optimization strategies. They also present an autotuner that will automatically

search for optimizations.

GraphIt allows programmers to reason about three different tradeoffs in optimizations: lo-

cality, work efficiency, and parallelism. Often, one can optimize one dimension at the detriment of

the others. For example, load-balancing is often an issue with graph computations. One can spend

more time and instructions (hurting work efficiency) to enable better load-balancing and locality. It

is often difficult to know a priori which optimizations are appropriate for a given algorithm and set

of data. GraphIt allows programmers (and the autotuner) to specify eleven different optimizations

which have varying effects on the tradeoff space. These strategies include: DensePull, DensePush,

DensePull-SparsePush, DensePush-SparsePush, edge-aware-vertex-parallel, edge-parallel, bitvec-

tor, vertex data layout, cache partitioning, NUMA partitioning, and kernel fusion.

52

A GraphIt program has two sections: the algorithm specification and the scheduling op-

timizations. The algorithm API is divided into set operators, vertexset operators, and edgeset

operators. Low level details such as synchronization and deduplication are abstracted away and

dealt with by the compiler. Lines of the algorithm specification can be annotated with labels (of the

form #label#), which are latter referenced in the scheduling section where specific optimizations

can be applied.

4.3 Linear Algebra-based Graph Systems

Many algorithms on graphs are readily described in terms of matrix operations using the

adjacency matrix, A, of the graph. For instance, to find the number of triangles in the graph, it

is Trace(A3)/6 (assuming an undirected graph with no self-loops). Each i, j element in A3 is the

number paths of length three going from i to j. Elements along the diagonal describe three-paths

going from a node back to itself, i.e. triangles. Since each node reports its participation in the

triangle, and does so twice (in each direction), we divide by six to get the total number of unique

triangles. There are many other examples of graph algorithms that can be described as operations

on the adjacency matrix [70, 114].

While many graph algorithms can be expressed as operations on sparse matrices, it is unclear

if that abstraction is best in terms of programmer productivity or parallel performance. Satish

et al. [174] begins to look at these issues, though mostly addressing performance issues with

Combinatorial BLAS [32] and other approaches. For our work, it is unclear how to account for

temporal properties on the edges. Multiplying an adjacency matrix with itself does not account for

the temporal constraints on the edges. Also, the question remains how streaming operations would

be implemented. On what version of the adjacency matrix would the operations be performed?

Frameworks that express graph algorithms as linear algebra include following:

• Combinatorial BLAS [32]

• Knowledge Discovery Toolbox [128]

53

• Pegasus [110]

• GraphMat [192]

We discuss these approaches below.

4.3.1 Combinatorial BLAS

Buluc and Gilbert [32] present one of the first libraries for performing graph computations in

terms of linear algebra primitives. They name it Combinatorial BLAS, combinatorial because

combinatorics and graph theory are strongly related, and BLAS, after the canonical Basic Linear

Algebra Subroutines, a well known library of successful primitives packages. Since adjacency

matrices are the primary target of the library, distributed sparse matrices are first class citizens in

Combinatorial BLAS. However, there are other functions that use dense matrices, and dense and

sparse vectors. Similar to BLAS, they define a set of primitive functions. For example, SpGEMM is

a function that multiplies two sparse matrices. In their paper introducing Combinatorial BLAS, the

authors provide details for two algorithms implemented with the new API, betweenness centrality

and Markov clustering.

4.3.2 Knowledge Discovery Toolbox

Another notable work using a linear algebra-type approach includes the Knowledge Discovery

Toolbox (KDT) [128]. KDT uses Combinatorial BLAS as computational kernels but presents

Python as the API.

4.3.3 Pegasus

Pegasus [110] implements a sparse matrix-vector multiplication (GIM-V or generalized it-

erated matrix-vector multiplication) on top of Hadoop and describes how this single primitive is

sufficient for algorithms such as PageRank, random walk with restart, diameter estimation, and

connected components.

54

4.3.4 GraphMAT

GraphMat [192] provides a way of expressing vertex-centric programs which are then mapped

to matrix-style computations. They argue that this approach garners the productivity of vertex-

centric programming while retaining the performance of optimized sparse matrix operations. They

show speedups between 1.1− 7× over other graph frameworks.

4.4 Graph APIs

In this section we discuss other APIs that have been developed for processing graph algo-

rithms:

• Parallel Boost Graph Library (PBGL) [81]

• The MultiThreaded Graph Library (MTGL) [22]

• Small-world Network Analysis and Partitioning (SNAP) [18]

• STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Rep-

resentation [17]

• Grappa [149, 148]

The Parallel Boost Graph Library (PBGL) [81] builds upon the Boost Graph Library (BGL)

[182], using concepts such as generic programming and the visitor pattern. Like the Standard

Template Library, BGL and PBGL extensively uses templates for class types so that the same

algorithm can be applied to different graph representations. For example, you could have an

adjacency-matrix representation of a graph or a compressed sparse row graph but breadth-first

search would be written generically enough to apply to both representations.

Another concept important to BGL and PBGL is the visitor pattern. Visitors allow pro-

grammers to modify algorithms to extend functionality in a fine-grained way. For example, the

examine vertex function of the BFS visitor could be used to increment a counter on each vertex in

55

order to count the number of paths going through each vertex in a betweenness calculation. In this

way, it is similar to the BFS traversals of Green-Marl and Ligra, just expressed in different way.

The MultiThreaded Graph Library [22] was directly inspired by PBGL, using many of the

same concepts such as generic templates and visitors. However, instead of a distributed setting as

with PBGL, MTGL focuses on shared-memory supercomputers such as the Cray XMT.

Another set of shared-memory approaches are SNAP [18] and STINGER [17]. Both SNAP

and STINGER come from George Tech. SNAP is a library for analyzing primarily static graphs

(but also some support for dynamic graphs) on shared memory machines. The primary building

blocks are called graph kernels, which include BFS, connected components, minimum spanning tree.

It is unclear what constitutes a kernel, and what is the set of expressible algorithms given a set

of kernels. While SNAP was more focused on static graph analysis, STINGER is for dynamically

changing graphs. Again, STINGER is limited to shared memory systems. With STINGER, they

developed several dynamic graph algorithms for connected components [136], clustering coefficient

[55], and betweenness centrality [80].

Grappa [149, 148] is an effort to create the illusion of a shared memory machine on a dis-

tributed cluster. They create a latency tolerant software layer on top of commodity hardware to

support data-intensive applications such as graph problems and mitigate latency problems associ-

ated with small messages, poor locality, and the need for fine-grained synchronization. In essence,

they want to recreate the nice features of the Terra MTA without the exorbitant cost for specialized

hardware [8]. In [148] they show how they can implement various frameworks such as GraphLab

and Spark and achieve greater performance than obtainable with native algorithms.

4.5 GPU Graph Systems

Programming graph computations on GPUs are particularly challenging due to the usually

random memory access patterns of many graph algorithms. However, several optimizations can be

employed to enable GPU-facilitated computations. Totem [68] and GStream [176] are some of the

first works to process graphs with GPUs where the size of the graph is larger than what is avail-

56

able on the device. IrGL [159] is an intermediate-level program representation that allows graph

algorithms to be described that are then translated into CUDA code with a compiler. The com-

piler automatically utilizes three types of optimizations for data-driven problems to allow efficient

utilization of the GPU. Groute [19] provides an asynchronous programming model that can run on

multi-GPU systems. Pan et al. also provide a framework for graph computations on multi-GPU

systems, published at the same time as Groute. While an important body of work, to date GPU

graph systems focus on static graph analysis and do not consider streaming.

4.6 SPARQL

SPARQL[66] (a recursive acronym for SPARQL Protocol and RDF Query Language) is a

SQL-like declarative language for querying the semantic web. It is used to express subgraphs of

interest in an RDF (Resource Description Framework) dataset. RDF is described in terms of triples:

a subject, a predicate, and an object. This can be thought of as describing a graph, where the

subject and object are nodes in the graph and the predicate is a labeled directed edge between the

two. SPARQL influenced our syntax in how temporal subgraphs are expressed in SAL.

Below is an example SPARQL query, query number 2 from the popular Lehigh University

Benchmark (LUBM) [86]: It finds graduate students who are a member of a department in the same

university where they received their undergraduate degree. Figure 4.1 gives a visual representation

of the query graph.

Listing 4.2: LUBM Query 2

PREFIX rdf: <http :// www.w3.org /1999/02/22 -rdf -syntax -ns#>

PREFIX ub: <http :// www.lehigh.edu/~zhp2 /2004/0401/ univ -bench.owl#>

SELECT ?X, ?Y, ?Z

WHERE

{?X rdf:type ub:GraduateStudent .

?Y rdf:type ub:University .

?Z rdf:type ub:Department .

?X ub:memberOf ?Z .

?Z ub:subOrganizationOf ?Y .

?X ub:undergraduateDegreeFrom ?Y}

57

Figure 4.1: SPARQL Query number two from the Lehigh University Benchmark.

The main building block of a SPARQL query is the basic graph pattern (BGP). The where

clause in LUBM query 2 is a BGP with six triple patterns. At its core is a triangle with additional

edges defining the vertex types. In general, SPARQL is well suited for specifying subgraphs of

interest, though it does have problems when the query graph is automorphic as this will produce

duplicate answers [77]. Besides BGPs, SPARQL also allows has other constructs such as unions,

negations, optionals, aggregations, and property paths.

The SPARQL standard does not address temporal aspects of the graph. There has been some

work in extending RDF and SPARQL to encompass temporal notions. In Guiterrez et al. [89, 88]

they present an extension of RDF to include temporal intervals. They provide a sketch of a query

language. Each edge has a temporal interval associated with it, [t1, t2], or just [T ] for short. You

can then use those intervals in queries. Using one of their examples, if you want to find students

that have taken a Masters-level course sometime since 2000, you would express the query

(?X, takes , ?C) : [ ?T] , ( ?C, type , Master ) : [ ?T] , 2000 ≤?T, ?T ≤ Now

In Tappolet and Bernstien [197], they present a query language τ -SPARQL for temporal

58

queries which is directly translatable to standard SPARQL queries. However, τ -SPARQL is not

sufficiently expressive for our needs. It is suited to finding edges valid at a given time, or returning

intervals of time when edges were valid, but not to expressing constraints on the temporal ordering

of edges relative to each other in a subgraph query.

Much of the semantic web research has focused on relatively static collections of RDF triples

with querying and reasoning implemented with that assumption. Another avenue has explored

streaming environments, where RDF triples are generated continuously. Several benchmarks are

commonly used to compare streaming RDF frameworks:

• LSBench [121] is a social network-based data generation with posts, comments, likes, follows

etc., that you would typically find with social network activity.

• SRBench [215] has as its basis in the Linked Open Data cloud [137] with real-world sensor

data, LinkedSensorData [117], which is published U.S. weather data.

• CityBench [7] focuses on smart-city, IoT-enabled applications.

Approaches that specifically target streaming RDF and continuous SPARQL queries include C-

SPARQL [62], Wukong [179], Wukong+S [216], EP-SPARQL [9], SPARQLStream [35], and CQELS

[120]. While the streaming aspect of this problem and the task of finding subgraphs overlap with

our goals, there are several aspects not addressed which our work targets. The temporal ordering

of edges is ignored; they find subgraphs within windows of time, but do not specify the relative

times of when edges occur, which is an important dimension when describing cyber attacks. Also,

they do not compute attributes on the streaming data, which can then be used as features for a

machine learning solution, or as a filtering mechanism for subgraph selection. RDF triples allow for

a rich set of edge types; however, for cyber problems, we have edges with multiple attributes, which

is awkward to express in RDF, the mechanism being reification [204]. Finally, with the exception

of Wukong+S [216] that has scaling numbers to 8 nodes, none of the streaming RDF/SPARQL

approaches are distributed.

59

4.7 Machine Learning

In terms of machine learning, there are several popular frameworks:

• Scikit-learn [163] is a python code base with a range of machine learning approaches. It is

a useful tool for machine learning and data mining research and projects, but it is largely

limited to computations on a single node without a much support for parallelization.

• Apache Mahout [129] focuses on distributed computations using a linear algebra framework

and mathematically focused Scala DSL.

• Deep learning frameworks: With the popularity of deep learning, several software frame-

works have been developed that allow the programmer to specify neural network archi-

tectures and modify hyperparameters. Some frameworks of note include Tensorflow [4],

Pytorch [162], and Keras [40]. All of these frameworks largely concentrate on employing

neural networks on a single node with one or more gpus available.

All of these approaches are valuable, allowing common machine learning tasks to be utilized and

explored with well-designed APIs; however, none address the streaming aspect of this work.

4.8 Data Stream Management Systems

Data Stream Management Systems (DSMS) are designed to specifically tackle streaming

data. One of the first efforts to address data management for streaming data is the Aurora project

[36, 3]. In the Aurora system model, a variety of data sources feed into a directed acyclic graph

(DAG) of processing operations. Operators can be one of filter, map (a generalized projection

operator), union (merging two or more streams into a single stream), bsort (buffered sort, an

approximate sort operator), aggregate (applying a function to a window of tuples), join (joins

two stream together over a specified time window), resample (similar to join but with a generalized

time window), and split (implicit operation deduced from the DAG). The DAG feeds into three

types of endpoints: continuous queries, views, and ad-hoc queries.

60

While the Aurora project focused on stream management within a single node, Medusa [39]

expanded that to handle a distributed setting, covering both the situation where the computational

resources are under a single administrative domain (intra-participant distribution) and the more

general case when services are provided by multiple autonomous participants (inter-participant

distribution). For intra-participant distributions, Aurora is deployed on multiple nodes, and lo-

cal decisions and pair-wise interactions between nodes is used to reconfigure the workload when

resource-usage becomes unbalanced. For inter-participant distributions, Medusa uses the economic

principles of an agoric system [142] to adjust the load management. In such a system participants

provide services such as data sources or analytical capabilities, and participants form contracts

where they pay for those services. The idea is that market forces will drive the system to an

efficient balance of computational resources.

The Borealis project [2] superseded the Aurora and Medusa projects, combining the two foci

into one. Borealis adds the ability to revise queries when corrections to data streams are provided.

It also allows dynamic query modification, e.g. switching to less expensive operations when the

system becomes overloaded. Finally, Borealis adds a framework for optimizing various QoS metrics

in a heterogeneous network comprised of servers and sensors. This family of work, Aurora, Medusa,

and Borealis, address an array of requirements for a streaming system, but focus largely on porting

SQL-like operations to a streaming environment, and they do not support streaming operators such

as topk as first-class citizens.

IBM developed a streaming language, Streams Processing Language (SPL) [96], and its pre-

decessor, SPADE [67]. SPL is the main avenue for programming applications in the IBM Streams

platform. The main abstraction of SPL is a graph where the edges are streams of data (tuples),

and the nodes are operators. SPL is very flexible, allowing for new operators to be defined using

language concepts from C, Java, SQL, Python, and ML. Comparing SAL to SPL, SAL is much

more domain specific. SAL comes with native support for streaming and semi-streaming operators.

SPL does not support these type of operators as first-class citizens, but given the expressiveness of

the language, they could conceivably be programmed into a custom operator.

61

There are many recent frameworks that provide streaming APIs:

• Apache Storm [190]

• Apache Spark [188]

• Apache Flink [12]

• Apache Heron [95]

• Apache Samza [172]

• Apache Beam [11]

• Apache Gearpump [13]

• Apache Kafka [64]

• Apache Apex [10]

• Google Cloud Dataflow [78]

Many of these could potentially be the basis of a backend implementation for SAL. Zhang et al.

[216] report success in adopting Storm as the backend of a streaming C-SPARQL engine [62], while

Spark has significantly longer latencies. We initially targeted Spark streaming as an implementation

backend for SAL but found that the overhead of keeping state from one discretized portion of the

stream to the next was prohibitive. In Chapter 7 we report on performance using Apache Flink for

streaming temporal subgraph matching. We found mixed results comparing SAM and Flink, largely

dependent on the frequency of occurring subgraphs. Regardless, our main contribution in this work

is the domain specific language of SAL. SAM was developed as a prototype implementation, and

it is certainly possible to develop another implementation using these frameworks as a backend.

4.9 Network Telemetry

In our work with cyber data, we focus on netflow data. Netflow data is a summary of

packet data going through a switch. Packets are organized by a 5-tuple key: source IP, destination

62

IP, source port, destination port, and protocol. Statistics are gathered around related packets

that share the same 5-tuple key. Previously, the netflow aggregations were either performed using

custom silicon (fixed-function ASICs) on high-end routers, or the packet traffic was mirrored to

remote collectors for analysis [186]. However, recently there is trend towards programmable packet

forwarding, opening the door for hybrid computations between the telemetry plane (the switch)

and the analysis plane. P4 [27], short for Programming Protocol-independent Packet Processors, is

a domain specific language to express computations on network traffic data. Switches such as the

Barefoot Tofino [151] are P4 enabled, enabling the performance of fixed-function ASICs but with

the benefit of being reconfigurable.

Recent work has demonstrated the benefits of hybrid telemetry/analysis plane computations.

Sonata [87] defines a declarative language for expressing network traffic computations, and then

automatically partitions the computations between the data and analysis planes. Flowradar [125]

strikes a balance between cheap switches with a paucity of resources and a remote collector with

ample resources to enable netflow calculations without sampling. Another work, *Flow [187],

has mechanisms for concurrent and dynamic queries by shifting aggregation functions from the

telemetry plane to the analysis plane.

This work in network telemetry is a natural path forward for SAL. Our current work focuses

on network traffic data after netflow calculations have already been performed. In essence, we are

operating entirely in the analysis plane. However, opening the door for custom calculations at the

packet level create opportunities for analysts to develop new features for SAL pipelines. Also, we

can create a unifying approach that can deploy SAL programs not only to the analysis plane, but

some of the computations can be moved up into the telemetry plane. Finally, most of the current

work in network telemetry focuses on data coming from one switch. Combining data from multiple

switches is largely unexplored. Our work with SAL and SAM has started in this direction. In

Chapters 6 and 7 we show coordination of up to 128 nodes, each acting as a separate sensor.

Chapter 5

Implementation

In this chapter we discuss how we go from SAL code to an implementation that runs in

parallel across a cluster. To translate SAL code, we used the Scala Parser Combinator Library

[175]. This allowed us to express the grammar of SAL and map those elements to C++ code

that utilizes a prototype parallel library that we wrote to execute SAL programs. We call this

library the Streaming Analytics Machine, or SAM, which is over 19,000 lines of code. Table

5.1 below shows the difference between directly writing C++ code utilizing SAM, or writing the

same program in SAL. Writing in SAL uses about 10-25 times fewer lines of code.

This chapter is divided up into three sections. We first talk about how data is partitioned

across the cluster via SAM in Section 5.1. Then we discuss vertex-centric computations such as

feature creation in Section 5.2. Finally we discuss the data structures and algorithm used for

subgraph matching in Section 5.3

C++ SAL

Disclosure 520 20

Machine Learning Pipeline 482 34

Temporal Triangle Query 238 13

Table 5.1: This table shows the conciseness of the SAL language. For the Disclosure pipeline(discussed in Section 2.1), the machine learning pipeline (discussed in Section6.1), and the temporaltriangle query (discussed in Chapter 7) the amount of code needed is reduced between 10-25 times.

64

Figure 5.1: Architecture of the system: data comes in over a socket layer and is then distributedvia ZeroMQ.

5.1 Partitioning the Data

SAM is architected so that each node in the cluster receives tuple data (see Figure 5.1). For

the prototype, the only ingest method is a simple socket layer. In maturing SAM, other options such

as Kafka [65] would be an obvious alternative. We then use ZeroMQ [6] to distribute the netflows

across the cluster. The pattern used to distribute the data occurs more than once within SAM,

and is encapsulated into a ZeroMQ Push Pull Communicator object, which we will refer to as

Communicator. The Communicator uses the push pull paradigm of ZeroMQ, where producers

submit data to push sockets and consumers collect from corresponding pull sockets. In general,

push sockets bind to an address and the pull sockets connect to an address, allowing multiple pull

sockets to collect messages from the same address.

Behind the scenes, ZeroMQ opens multiple ports in the dynamic range (49152 to 65535)

for each ZMQ socket that is opened logically in the code. Nevertheless, for some uses of the

Communicator we found that contention using the ZMQ sockets created significant performance

bottlenecks. As such, we created the option to specify multiple push sockets per node, and anytime

a node needs to send data to another node, it randomly selects one of the push sockets to use.

65

Figure 5.2: The push pull object.

Figure 5.2 gives an overview of the design. If there are n nodes in the cluster and p push sockets

per node, in total there are (n−1)p push sockets. However, for the partitioning phase, only one push

socket was needed. The need to have multiple push sockets arose because of subgraph matching,

and is discussed in more detail in Section 5.3.

When multiple push sockets were necessary, we also found that instantiating multiple threads

to collect the data increased throughput. t threads are dedicated to pulling data from the pull

sockets. Again for the partitioning phase, one thread was sufficient. In our experiments, we found

4 push sockets per node and 16 total pull threads worked best for most situations. However, this is

probably a consequence of the number of cores per node, which in this case was 20. These results

are discussed more in Section 7.1.

As discussed in Chapter 2, the data is partitioned with SAL statements of the form:

HASH <TupleField> WITH <HashFunction>;

SAM uses the specified hash function and applies it to the tuple field and sends it to the proper

node. For all of our experiments, we have the following two statements:

HASH SourceIp WITH IpHashFunction ;

HASH DestIp WITH IpHashFunction ;

66

We do this since almost all computations for cyber analysis, our initial target domain, involves

aggregating statistics and features per IP address. For each netflow that a node receives, it performs

a hash (shared across the cluster) of the source IP in the netflow mod the cluster size to determine

where to send the netflow. Similarly, the node hashes the destination IP mod the cluster size and

sends the IP to that location. Thus for each netflow a node receives over the socket layer, it sends

the same netflow twice over ZeroMQ. Each node is then prepared to perform calculations on a per

IP basis. SAL can be extended to use other hash functions, as discussed in Chapter 2.

That concludes our discussion of how tuple data is partitioned across the cluster. Once

the data is partitioned, we can then perform queries involving vertex-centric computations and

subgraph matching.

5.2 Vertex-centric Computations

The partitioning phase can be thought of as defining vertices in the graph. For cyber data,

by partitioning by source IP and destination IP each node in the cluster has access to all the

tuples/edges related to a particular IP. Thus we can easily calculate statistics regarding each IP

with no further communication within the cluster for these downstream calculations.

There are three different interfaces (defined as abstract or base classes) that are key to vertex-

centric computations:

• Data Source: These classes are how the cluster receives data. As mentioned before, we im-

plemented one avenue to ingest data through socket communications called ReadSocket.

ReadSocket implements the AbstractDataSource class, which has two methods: con-

nect and receive. For ReadSocket, connect opens the socket for communication while

receive starts a loop to ingest the data. This data is then made available to Consumers

through the Producer interface, discussed next.

• Producers: These objects produce tuples that can be consumed by other classes called

Consumers. The important methods are registerConsumer, which allows a Consumer

67

to express interest in the output of a Producer, and parallelFeed. Pseudocode for the

parallelFeed method is shown in Algorithm 1. It emits tuples to consumers in parallel.

• Consumers: Consumers implement the AbstractConsumer class that has the method

consume(tuple). Producers call this method to supply the consumers with tuples to

ingest.

Algorithm 1 Producer: Feeding Tuples in Parallel

1: function Producer.parallelFeed(edge)2: mutex.lock()3: queue[numItems]← edge4: ++numItems5: if numItems ≥ queueLength then6: for all c ∈ consumers do . Iterate over all consumers7: for all e ∈ queue do . Parallel loop8: c.consume(e)9: end for

10: end for11: end if12: mutex.unlock()13: end function

Many SAL statements and operators map to either consumer and/or producer interfaces and

to an underlying C++ class that provides the specific functionality. Table 5.2 gives an overview.

C++ Name Consumer Producer Feature Creator Maps To

ReadSocket x FlowStream

ZeroMQPushPull x x N/A

Filter x x FILTER

Transform x x TRANSFORM

CollapsedConsumer x x COLLAPSE

Project x COLLAPSE

EHSum x x sum

EHAve x x ave

EHVariance x x var

TopK x x topk

Table 5.2: A listing of the major classes in SAM and how they map to language features in SAL. Wealso make the distinction between consumers, producers, and feature creators. Producers all sharea parallelFeed method that sends the data to registered consumers, which is how parallelizationis achieved. Each of the consumers receive data from producers. If a consumer is also a featurecreator, it performs a computation on the received data, generating a feature, which is then addedto a thread-safe feature map.

68

Often consumers also generate features. Each node creates a thread-safe feature map to

collect features that are generated. The function signatures is as follows:

update Inse r t ( s t r i n g key , s t r i n g featureName , Feature f )

The key is generated by the key fields specified in the STREAM BY statement. The fea-

tureName comes from the identifier specified in the FOREACH GENERATE statement. For

example, in the below SAL snippet, the key is created by using a string hash function on the

concatenation of the source IP and destination IP. The identifier is Feature1 as specified in the

FOREACH GENERATE statement.

DestSrc = STREAM Netf lows BY SourceIp , DestIp ;

Feature1 = FOREACH DestSrc GENERATE ave ( SrcTotalBytes ) ;

The scheme for ensuring thread-safety in the feature map comes from Goodman et al. [75].

The Project class provides the functionality of the COLLPASE statement. The Project

class creates features similar to the Feature Creator classes, but they cannot be accessed directly

through the API. The features are Map features, in other words the sets Ml using the terminology

from Section 2.1. These Map features are added to the same feature map used for all other

generated features. These Map features can then be used by the CollapsedConsumer class,

which can be specified to calculate statistics on the map for each kept key in the stream.

5.3 Subgraph Matching

In this section we focus on the data structures, processes, and algorithms that enable a

subgraph query. For discussion, we repeat the Watering Hole Attack query previously given in

Listing 2.1.

69

Listing 5.1: Subgraph for Watering Hole Attack

1 SUBGRAPH ON Netf lows WITH source ( SourceIp )

2 AND t a r g e t ( DestIp )

3 {

4 t a r g e t e1 ba i t ;

5 t a r g e t e2 c o n t r o l l e r ;

6 s t a r t ( e2 ) > end ( e1 ) ;

7 s t a r t ( e2 ) − end ( e1 ) < 10 ;

8 ba i t in Top1000 ;

9 c o n t r o l l e r not in Top1000 ;

10 }

Subgraph queries are translated from the SPARQL-like syntax to a sequence of C++ ob-

jects, namely of type EdgeExpression, EdgeConstraintExpression, and VertexConstrain-

tExpression. In the Watering Hole Attack query, the first two topological statements become

EdgeExpressions, in essence, assigning labels to vertices and edges. The next two lines define tem-

poral constraints on the edges which become EdgeConstraintExpression’s. The last two lines are

examples of VertexConstraintExpression’s. SAM takes the EdgeExpressions and TimeEdgeExpres-

sions and develops an order of execution based upon the temporal ordering of the edges. In this

example, e1 occurs before e2.

The approach we have taken assumes there is a strict temporal ordering of the edges. For

our intended use case of cyber security, this is generally an acceptable assumption. Most queries

have causality implicit in the question. For example, the Watering Hole Attack is a sequence of

actions, with the initial cause the malvertisement exploiting a vulnerability in a target system,

which then causes other actions, all of which are temporally ordered. As another example, botnets

with a command and control server, commands are issued from the command and control server,

which then produces subsequent actions by the bots, which again produces a temporal ordering of

the interactions. Our argument is that requiring strict temporal ordering of edges is not a huge

70

constraint, or at the very least covers many relevant questions an analyst might ask of cyber data.

We leave as future work an implementation that can address queries that do not have a temporal

ordering on the edges.

It should be noted that while SAM, the implementation, currently requires temporal ordering

of edges, SAL, the language, is not so constrained. SAM can certainly be expanded to include other

approaches that do not require temporal ordering of edges.

Once the subgraph query has been finalized with the order of edges determined, SAM is now

set to find matching subgraphs within the stream of data. Before discussing the actual algorithm,

we will first discuss the data structures used by SAM.

5.3.1 Data Structures

We make use of a number of thread-safe custom data structures, namely

• Compressed Sparse Row (CSR)

• Compressed Sparse Column (CSC)

• ZeroMQ Push Pull Communicator (Communicator)

• Subgraph Query Result Map (ResultMap)

• Edge Request Map (RequestMap)

• Graph Store (GraphStore)

In paranthesis is how we will refer to them in the remainder of this thesis.

5.3.1.1 Compressed Sparse Row

CSR is common way of representing sparse matrices with the earliest known description a

1967 article [32]. The data structure is useful for representing sparse graphs since graphs can be

represented with a matrix, the rows signifying source vertices and the columns destination vertices.

71

Instead of a |V | × |V | matrix, where |V | is the number of vertices, CSR’s have space requirement

O(|E|), where |E| is the number of edges. In the case of sparse graphs, |E| � |V | × |V |.

Figure 5.3 presents the traditional CSR data structure. There is an array of size |V | where

each element of the array points to a list of edges. For array element v, the list of edges are all the

edges that have v as the source. With this index structure, we can easily find all the edges that

have a particular vertex v as the source. However, a complete scan of the data structure is required

to find all edges that have v as a destination, leading to the need for the compressed sparse column

data structure, described in the next section.

CSR’s as presented work well with static data where the number of vertices and edges remain

constant. However, our work requires edges that expire, and also for the possibility that vertices

may come and go. To handle this situation, we create an array of lists of lists of edges, with each

array element protected with a mutex. Figure 5.4 shows the overall structure. Instead of an array

of size |V |, we create an array of bins that are accessed via a hash function. When an edge is

consumed by SAM, it is added to this CSR data structure. The source of the edge is hashed, and

if the source has never been seen, a list is added to the first level. Then the edge is added to the

list. If the vertex has been seen before, the edge is added to the existing edge list. Additionally,

SAM keeps track of the longest duration of a registered query, and any edges that are older than

the current time minus the longest duration are deleted. Checks for old edges are done anytime a

new edge is added to an existing edge list.

The CSR has mutexes protecting each bin, and any access to a bin must first gain control

of the mutex. This will be important later when we describe the algorithm for finding subgraph

matches. Regarding performance, as each bin has its own mutex, the chance of collision and thread

contention is very low. In metrics gathering, waiting for mutex locks of the CSR data structure

has never been a significant factor in any of our experiments.

72

Figure 5.3: Traditional Compressed Sparse Row (CSR) data structure.

Figure 5.4: Modified Compressed Sparse Row (CSR) data structure.

73

5.3.1.2 Compressed Sparse Column

The Compressed Sparse Column (CSC) is exactly the same as the CSR, the only difference

is we index on the target vertex instead of the source vertex. Both the CSR and CSC are needed,

depending upon the query invoked. We may need to find an edge with a certain source (CSR), or

an edge with a certain target (CSC).

5.3.1.3 ZeroMQ PushPull Communicator

This object was discussed previously in Section 5.1. Subgraph matching led to the modifi-

cation of multiple push sockets and multiple pull threads. There are two different communicators

used: and Request Communicator and an Edge Communicator. The Request Communicator sends

requests to other nodes for edges fitting a particular pattern. The Edge Communicator sends

matching edges back. It is the Edge Communicator that benefits most from having multiple push

sockets and threads. In our experiments, we found 4 push sockets per node and 16 total pull threads

worked best for the Edge Communicator in most situations, which makes sense because each socket

had 20 cores. These results are discussed more in Section 7.1.

The Communicator allows for callback functions to be registered. These callback functions are

invoked whenever a pull socket receives data. We use callback functions to enable proper handling

of requests by the Request Communicator and edges with the Edge Communicator. These are

discussed in more detail in Sections 5.3.2.2 and 5.3.2.3.

5.3.1.4 Subgraph Query Result Map

The ResultMap is used to store intermediate query results. There are two central methods,

add and process. The method add takes as input an intermediate query result, creates additional

query results based on the local CSR and CSC, and then creates a list of edge requests for other

edges on other nodes that are needed to complete queries. The method process is similar, except

in this case it takes an edge that has been received by the node and it checks to see if that edge

satisfies any existing intermediate subgraph queries. The details of processing intermediate results

74

and edges used will be discussed in greater depth in Section 5.3.2.

The ResultMap uses a hash structure to store results. It is an array of vectors of intermediate

query results. The array is of fixed size, so must be set appropriately large to deal with the number of

results generated during the maximum time window specified for all registered queries, otherwise

excessive time is spent linearly probing the vector of intermediate results. There are mutexes

protecting each access to each array slot. For each intermediate result created, it is indexed by the

source, the target, or a combination of the source and target, depending on what the query needs

to satisfy the next edge. The index is the result of a hash function, and that index is used to store

the intermediate result within the array data structure.

5.3.1.5 Edge Request Map

The RequestMap is the object that receives requests for edges that match certain criteria,

and sends out edges to the requestors whenever an edge is found that matches the criteria. There

are two important methods to mention: addRequest(EdgeRequest) and process(Edge). The

addRequest method takes an EdgeRequest object and stores it. The process method takes an

Edge and checks to see if there are any registered edge requests that match the given edge.

The main data structure is again a hash table. The approach to hashing is similar to the

ResultMap, as it depends on whether the source is defined in the request data structure, the target

is defined, or both. The index is then used to find a bin with an array of lists of EdgeRequest objects.

Similar to the ResultMap, whenever process is called, edge requests that have expired are removed

from the list.

5.3.1.6 Graph Store

The GraphStore object uses all the previous data structures, orchestrating them together

to perform subgraph queries. It has a pointer to the ResultMap and to the RequestMap. It

also has pointers to two Communicators, one for sending/receiving edges to/from other nodes

(Edge Communicator), and another for sending/receiving edge requests (Request Communicator).

75

GraphStore maintains the CSR and CSC data structures. How all these pieces work together is

described in the next section.

5.3.2 Algorithm

Figure 5.5: An overview of the algorithm employed to detect subgraphs of interest. Asynchronousthreads are launched to consume each edge. The consume threads then performs five differentoperations. Another set of threads are pulling requests from other nodes for edges within theRequest Communicator. Finally, one more set of threads pull edges from other nodes in reply toedge requests.

The algorithm employed to find matching subgraphs employs three sets of threads:

• GraphStore consume threads.

• The threads associated with the Request Communicator.

• The threads associated with the Edge Communicator.

76

5.3.2.1 GraphStore Consume(Edge) threads

The GraphStore object’s primary method is the consume function. GraphStore is tied into

a producer, and any edges that the producer generates is fed to the consume function. For each

consume call, an asynchronous thread is launched. We found that occasionally a call to consume

would take an exorbitant amount of time, orders of magnitude greater than the average time. These

outliers were a result of delays in the underlying ZeroMQ communication infrastructure, proving

difficult to entirely prevent. As such, we added the asynchronous approach, effectively hiding these

outliers by allowing work to continue even if one thread was blocked by ZeroMQ.

Each consume thread performs the following actions:

(1) Adds the edge to the CSR and CSC data structures.

(2) Processes the edge against the ResultMap, looking to see if any intermediate results can

be further developed with the new edge.

(3) Checks to see if this edge satisfies the first edge of any registered queries. If so, adds the

results to the ResultMap.

(4) Processes the edge against the RequestMap to see if there are any outstanding requests

that match with the new edge.

(5) Steps two and three can result in new edges being needed that may reside on other nodes.

These edge requests are accumulated and sent out.

For steps 2-4, we provide greater detail below:

Step 2: ResultMap.process(edge): ResultMap’s process(edge) function is outlined in

Algorithm 2 on page 79. At the top level, the process(edge) dives down into three process(edge,

indexer, checker) calls. Intermediate query results are indexed by the next edge to be matched.

The srcIndexFunc, trgIndexFunc, bothCheckFunc are lambda functions that perform a hash

function against the appropriate field of the edge to find where relevant intermediate results are

77

stored within the hash table of the ResultMap. The srcCheckFunc is a lambda function that takes

as input a query result. It checks that the query result’s next edge has a bound source variable

and an unbound target variable. Similarly, trgCheckFunc returns true if the query result’s next

edge has an unbound source variable and bound target variable. The bothCheckFunc returns true

if both variables are bound. These lambda functions are passed to the process(edge, indexer,

checker) function, where the logic is the same for all three cases by using the different lambda

functions.

The process(edge, indexer, checker) function (line 12) on page 79 uses the provided

indexer to find the location (line 14) of the potentially relevant intermediate results in the hash

table (class member alr). Line 15 finds the time of the edge. We use this on line 20 to judge

whether an intermediate result has expired and can be deleted. On line 18 we iterate through all

the intermediate results found in the bin alr[index]. We use the checker lambda function to make

sure the intermediate result is of the expected form, and if so, we try to add an edge to the result.

The addEdge function (line 23) adds an edge to the intermediate result if it can, and if so, returns

a new intermediate result but leaves the pre-existing result unchanged. This allows the pre-existing

intermediate result to match with other edges later. On line 25 we add the new result to the list,

gen, of newly generated intermediate results. On line 31 we pass gen to processAgainstGraph.

The processAgainstGraph function is used to see if there are existing edges in the CSR

or CSC that can be used to further complete the queries. The overall approach is to generate

frontiers of modified results, and continue processing the frontier until no new intermediate results

are created. Line 39 sets the frontier to be the beginning of gen’s iterator. Then we push back

a null element onto gen on line 40 to mark the end of the current frontier. Line 41 continues

iterating until there are no longer any new results generated. Line 42 iterates over the frontier

until a null element is reach. Lines 45 and 46 find edges from the graph that match the provided

intermediate result on the frontier. Then on line 47 we try to add the edges to the intermediate

result, creating new results that are added to the new frontier on line 50. Eventually the updated

list of intermediate results is returned to the process(edge, indexer, checker) function. This

78

list is then added to the hash table on line 33. The function add nocheck is a private method

that differs from the public add in that it doesn’t check the CSR and CSC, since that step has

already been taken. It also generates edge requests for edges that will be found on other nodes,

and that list is returned on line 35 and then in the overarching process function on line 9.

Step 3: Checking registered queries: When a GraphStore object receives a new edge,

it must check to see if that edge satisfies the first edge of any registered queries. The process is

outlined in Algorithm 3. The GraphStore.checkQueries(edge) function is straightforward. On

line 3 we iterate through all registered queries, checking to see if the edge satisfies the first edge of

the query (line 4), and if so, adds a new intermediate result to the ResultMap (line 5).

The ResultMap’s add function begins on line 11. We may create intermediate results where

the next edge belongs to another node, so we initialize a list, requests, to store those requests

(line 12). We need to check the CSR and CSC if the new result can be extended further by local

knowledge of the graph. As such we make a call to processAgainstGraph on line 15. On line 16

we iterate through all generated intermediate results. If the result is not complete, we calculate an

index (line 18), and place it within the array of lists of results (line 20) with mutexes to protect

access. If the result is complete, we send it to the output destination (line 23). At the very end we

return a list of requests, which is then aggregated by GraphStore’s checkQueries function, and

then returned on line 8 and eventually used by step 5 to send out the requests to other nodes.

Step 4: RequestMap.process(edge) The purpose of RequestMap’s process(edge) is to

take a new edge and find all open edge requests that match that edge. The process is detailed in

Algorithm 4. The opening logic of the RequestMap.process(edge) function is similar to that of

ResultMap’s process function. There are three lambda functions for indexing into RequestMap’s

hash table structure, which is an array of lists of EdgeRequests (ale). There are also three lambda

functions for checking that an edge matches a request. The indexer and the checker for each of

the three cases are each passed to the process(edge, indexer, checker) function on lines 2 -

4. The process(edge, indexer, checker) function iterates through all of the edge requests with

the same index as the current edge (line 11), deleteing all that have expired (line 13). If the edge

79

Algorithm 2 ResultMap: Processing a New Edge

1: function ResultMap.process(edge)2: requests, a list of edge requests.3: requests += process(edge, srcIndexFunc4: srcCheckFunc)5: requests += process(edge, trgIndexFunc,6: trgCheckFunc)7: requests += process(edge, bothIndexFunc,8: bothCheckFunc)9: return requests

10: end function11:12: function ResultMap.process(edge, indexer, checker)13: gen, a list of intermediate results generated.14: index← indexer(edge)15: currentT ime← getT ime(edge)16: mutexes[index].lock()17: requests, a list of edge requests.18: for all r ∈ alr[index] do19: if r.isExpired(currentT ime) then20: erase(r)21: else22: if checker(r) then23: newR← r.addEdge(edge)24: if newR not null then25: gen.push back(newR)26: end if27: end if28: end if29: end for30: mutexes[index].unlock()31: gen← processAgainstGraph(gen)32: for all g ∈ gen do33: requests += add nocheck(g)34: end for35: return requests36: end function37:38: function ResultMap.processAgainstGraph(gen)39: frontier ← gen.begin()40: gen.push back(null)41: while frontier 6= gen.end() do42: while frontier 6= null do43: if !frontier.complete() then44: ++frontier45: foundEdges = csr.findEdges(frontier)46: foundEdges += csc.findEdges(frontier)47: for all edge ∈ foundEdges do48: newR← frontier.addEdge(edge)49: if newR not null then50: gen.push back(newR)51: end if52: end for53: end if54: end while55: frontier ← frontier.erase()56: gen.push back(null)57: end while58: return gen59: end function

80

Algorithm 3 Checking Registered Queries

1: function GraphStore.checkQueries(edge)2: requests, a list of edge requests.3: for all q ∈ queries do4: if q.satisfiedBy(edge) then5: requests += resultMap.add(Result(edge))6: end if7: end for8: return requests9: end function

10:

11: function RequestMap.add(result)12: requests, a list of edge requests.13: gen, a list of intermediate results generated.14: gen.push back(result)15: gen← processAgainstGraph(gen)16: for all g ∈ gen do17: if !g.complete() then18: index = g.hash()19: mutexes[index].lock()20: alr[index].push back(g)21: mutexes[index].unlock()22: else23: output(g)24: end if25: end for26: return requests27: end function

81

matches the request, the edge is then sent to another node using the EdgeCommunicator (line 16).

Algorithm 4 RequestMap: Processing a New Edge

1: function RequestMap.process(edge)2: process(edge, srcIndexFunc, srcCheckFunc)3: process(edge, trgIndexFunc, trgCheckFunc)4: process(edge, bothIndexFunc, bothCheckFunc)5: end function6:

7: function RequestMap.process(edge, indexer, checker)8: index← indexer(edge)9: currentT ime← getT ime(edge)

10: mutexes[index].lock()11: for all e ∈ ale[index] do12: if e.isExpired(currentT ime) then13: erase(e)14: else15: if checker(e, edge) then16: edgeCommunicator.send(edge)17: end if18: end if19: end for20: mutexes[index].unlock()21: end function

5.3.2.2 Request Communicator Threads

Another group of threads are the pull threads of the Request Communicator. These threads

pull from the ZMQ pull sockets dedicated to gathering edge requests from other nodes. One callback

is registered, the requestCallback. The requestCallback calls two methods: addRequest

and processRequestAgainstGraph. The addRequest method registers the request with the

RequestMap. The method processRequestAgainstGraph checks to see if any existing edges are

already stored locally that match the edge request.

5.3.2.3 Edge Communicator Threads

The last group of threads to discuss are pull threads of the Edge Communicator and its call-

back function, edgeCallback. Upon receiving an edge through the callback, ResultMap.process(edge)

82

is called, which is discussed in detail in section 5.3.2.1. The method process(edge) produces edge

requests, which are then sent via the Request Communicator, the same as Step 5 of Graph-

Store.consume(edge).

5.4 Summary

In this chapter we’ve covered the implementation of the Streaming Analytics Machine or

SAM, which is a library used for expressing both streaming machine learning pipelines and temporal

subgraph queries. SAM is over 19,000 lines of code. The Streaming Analytics Language, SAL, maps

onto SAM classes, allowing SAL to run in parallel across a cluster.

Chapter 6

Vertex Centric Computations

In this chapter we discuss performance of vertex-centric computations expressed in SAL, i.e.

programs that do not utilize subgraph matching. We discuss both the performance achieved for

a trained classifier obtained from a SAL pipeline and the scaling observed on a cluster of size 64

nodes.

6.1 Classifier Results

To validate the value of SAL in expressing pipelines and in filtering out benign data, we used

a simple program where we stream the netflows two ways, using two STREAM BY statements,

one keyed by destination IP and the other by source IP. Then we create features based on the ave

and var streaming operators for each of the fields:

(1) SrcTotalBytes

(2) DestTotalBytes

(3) DurationSeconds

(4) SrcPayloadBytes

(5) DestPayloadBytes,

(6) SrcPacketCount

(7) DestPacketCount

84

This results in 28 total features. The entire program can be found in Appendix A.1.

We apply this pipeline to CTU-13 [63], which contains 13 botnet scenarios. CTU-13 has

some nice characteristics, including:

• Real botnet attacks: A set of virtual machines were created and infected with botnets.

• Real background traffic: Traffic from their university router was captured at the same time

as the botnet traffic. The botnet traffic was bridged into the university network.

• A variety of protocols and behaviors: The scenarios cover a range of protocols that were

used by the malware such as IRC, P2P, and HTTP. Also some scenarios sent spam, others

performed click-fraud, port scans, DDoS attacks, or Fast-Flux.

• A variety of bots: The 13 scenarios use 7 different bots.

Since the SAL program inherently encodes historical information (each generated feature

is calculated on a sliding window of data), we can’t treat each netflow independently and thus

can’t create random subsets of the data as is usual in a cross-validation approach. We instead

split each scenario into two parts. We want to keep the number of malicious netflows about

the same in each part (we need enough examples to train on), so we find the point in the sce-

nario timewise where the malicious examples are balanced. Namely we have two sets, P1 and

P2, where ∀n1 ∈ P1, ∀n2 ∈ P2, T imeSeconds(n1) < TimeSeconds(n2) and |Malicious(P1)| ≈

|Malicious(P2)|, where TimeSeconds returns the time in seconds of the netflow and Malicious

returns a subset of the provided set of all the malicious netflows in the provided set. With each

scenario split into two parts, we train on the first part and test on the second part. Then we

switch: train on the second and test on the first. We make use of a Random Forest Classifier [28]

as implemented in scikit-learn [163].

After generating the features as specified by the SAL program in Appendix A.1, we performed

a greedy search over the features to down-select to the most important ones. The process is outlined

in Algorithm 5, but the general notion is to add features one at a time until no improvement is

85

Algorithm 5 Greedy Process for Selecting Features

features← Features from SAL programselected← ∅bestAUC ← 0prevAUC ← −1improvement← Truewhile improvement do

iterBestAUC ← 0iterBestFeature← Nonefor f ∈ features do

candidateSet← selected ∪ fcandidateAUC ← Average AUC using candidateSetif candidateAUC > iterBestAUC then

iterBestAUC ← candidateAUCiterBestFeature← f

end ifend forif iterBestAUC > bestAUC then

bestAUC = iterBestAUCfeatures = features− iterBestFeatureselected = selected ∪ iterBestFeature

elseimprovement = False

end ifend while

86

found in the average AUC over the 13 scenarios.

We found that the following eight features provided the best performance across the 13

scenarios. They are listed in the order they were added using the greedy approach above.

(1) Average DestPayloadBytes

(2) Variance DestPayloadBytes

(3) Average DestPacketCount

(4) Variance DestPacketCount

(5) Average SrcPayloadBytes

(6) Average SrcPacketCount

(7) Average DestTotalBytes

(8) Variance SrcTotalBytes

Using the above eight features, Figure 6.1 shows the AUC of the ROC for each of the 13

scenarios. In four of the scenarios, 1, 3, 6, and 11, training on either half was sufficient for the

other half, with AUCs between 0.926 and 0.998. Some scenarios, namely 2, 5, 8, 12, and 13, the

first half was sufficient to obtain AUCs between 0.922 and 0.992 on the second half, but the reverse

was not true. For scenario 10, training on the second half was predictive of the first half, but not

the other way. Scenario 9 had AUCs of 0.820 and 0.887, which is decent, but lackluster compared

to the other scenarios. The classifier had issues with scenarios 4 and 7. Scenario 7 did not perform

well, probably because there were only 63 malicious netflows, as can be seen in Table 6.1. We are

not sure why the classifier struggled with scenario 4.

6.2 Scaling

For our scaling experiments, we made use of Cloudlab [168], which is a set of clusters dis-

tributed across three sites, Utah, Wisconsin, and South Carolina, where researchers can provision

87

Figure 6.1: This graphic shows the AUC of the ROC curve for each of the CTU scenarios. We firsttrain on the first half of the data and test on the second half, and then switch.

Table 6.1: A more detailed look at the AUC results. The second column is the AUC of the ROCafter training on the first half of the data and then testing on the second half. The third columnis reversed: trained on the second half and tested on the first half.

Scenario AUC AUC Netflows Botnet % Botfirst second Netflows

1 0.944 0.926 2,824,636 40,961 1.45%

2 0.930 0.887 1,808,122 20,941 1.16%

3 0.998 0.996 4,710,638 26,822 0.57%

4 0.476 0.775 1,121,076 2,580 0.23%

5 0.922 0.843 129,832 901 0.69%

6 0.972 0.976 558,919 4,630 0.83%

7 0.604 0.761 114,077 63 0.06%

8 0.992 0.886 2,954,230 6,127 0.21%

9 0.820 0.887 2,087,508 184,987 8.86%

10 0.666 0.950 1,309,791 106,352 8.12%

11 0.996 0.995 107,251 8,164 7.61%

12 0.903 0.796 325,471 2,168 0.67%

13 0.938 0.770 1,925,149 40,003 2.08%

88

a set of nodes to their specifications. This allowed us to create an image where our code, SAM, was

deployed and working, and then replicate that image to a cluster size of our choice. In particular

we make use of the Clemson system in South Carolina. The Clemson system has 16 cores per node,

10 Gb/s Ethernet and 256 GB of memory. We were able to allocate a cluster with 64 nodes.

Figure 6.2 shows the weak scaling results. Weak scaling examines the solution time where the

problem size scales with the number of nodes, i.e. there is a fixed problem size per node. For our

experiments, each node was fed one million netflows. Thus, for n nodes, the total problem size is n

million netflows. Each node had available the entire CTU dataset concatenated into one file. Then

we randomly selected for each node a contiguous chunk of one million netflows. For each randomly

selected chunk, we renamed the IP addresses to simulate a larger network instead of replaying the

same IP addresses, just from different time frames. For comparison we also ran another round

with completely random IP addresses, such that an IP address had very little chance of being in

multiple netflows. This helped us determine if scaling issues on the realistic data set were from

load balancing problems or some other issue. Each data point is the average of three runs.

As can be seen from Figure 6.2, the renamed IP set of runs peaks out at 61 nodes or 976

cores where we obtain a throughput of 373,000 netflows per second or 32.2 billion per day. For

the randomized IP set of runs, the scaling is noticeably steeper, reaching a peak throughput of

515,000 netflows per second (44.5 billion per day) with 63 nodes. Figure 6.3 takes a look at the

weak scaling efficiency. This is defined as E(n) = T1/Tn, where Ti is the time taken by a run with

i nodes. Efficiency degrades quicker for the renamed IP set of runs while the randomized IP runs

hover close to 0.4. We believe the difference is due to load balance issues. For the randomized IP

set of runs, the work is completely balanced between all 64 nodes. The renamed IP runs are more

realistic, akin to a power law distribution, where a small set of nodes account for most of the traffic.

In this situation, it becomes difficult to partition the work evenly across all the nodes.

As far as we are aware, we are the first to show scalable distributed network analysis on

a cluster of size 64 nodes. The most direct comparison to other works in terms of scaling is

Bumgardner and Marek [33]. In this work, they funnel netflows through a 10 node cluster running

89

Figure 6.2: Weak Scaling Results: For each run with n nodes, a total of n million netflows were runthrough the system with a million contiguous netflows per node randomly selected from the CTUdataset. There were two types of runs, with renamed IPs (to create the illusion of a larger network)or with randomized IPs. For the renamed IPs, there continues to be improved throughput until 61nodes/ 976 cores. For the randomized IPs, the scaling continues through the total 64 nodes / 1024cores.

Figure 6.3: This shows the weak scaling efficiency. The efficiency continues to degrade for therenamed IPs, but for the randomized IPs, the efficiency hovers a little above 0.5. This is largelydue to the fact that the randomized IPs almost perfectly load-balances the work, while the renamedIPs encounters load imbalances because of the power-law nature of the data.

90

Storm [190]. They call their approach a hybrid stream/batch system because it uses Storm to

stream netflow data into a batch system, Hadoop [90], for analysis. The stream portion is what

is most similar to our work. Over the Storm pipeline, they achieve a rate of 234,000 netflows per

second, or about 23,400 netflows per second, per node. Our pipeline with ten nodes achieved a

rate of 13,900 netflows per second, per node. However, their pipeline is much simpler. They do

not partition the netflows by IP and calculate features. The only processing they undergo during

streaming is adding subnet information to the netflows, which does not require partitioning across

the cluster.

There is other work which performs distributed network analysis, but their focus is on packet-

level data [123, 111, 133]. As such it is difficult to compare as their metrics are in terms of

GBs/second, and of course the nature of packet analysis is significantly different than netflow

analysis.

In terms of botnet identification, Botfinder [199] and Disclosure [23] report single node batch

performance numbers. Our approach on a single 16-core node achieved a rate of 30,500 netflows per

second. For Botfinder, they extract 5 features on a 12 core Intel Core i7 chip, achieving a rate of

46,300 netflows per second. Disclosure generates greater than nine features (the text is ambiguous

in places, referencing unspecified statistical features) on a 16 core Intel Xeon CPU E5630. They

specify that they run the feature extraction for one day’s worth of data in 10 hours and 53 minutes,

but they do not clearly state if that is on both data sets they chose to evaluate or just one. If it is

both, the rate is roughly 40,000 netflows per second. Both of these approaches are batch processes.

In the work of hybrid batch/stream system, Bumgardner and Marek [33] note a ten fold increase

in throughput going from streaming to batch processing. For the stream processing phase, they

achieved a throughput of 234,000 netflows per second (on ten nodes). They then run MapReduce

jobs on the stored data in batch, achieving rates between 1.7 million netflows per second to 2.6

million netflows per second (again on ten nodes). Since our approach is streaming, comparing to

rates achieved by batch processes may not be a fair characterization. During stream processing,

the network becomes the limiting factor while during batch processing, the network becomes less

91

of an issue.

6.3 Summary

In this chapter we’ve exercised the vertex-centric portion of SAL. We developed a SAL pro-

gram for the CTU-13 [63] dataset and gathered both classifier and scaling performance. The trained

classifier obtained an average AUC of the ROC of 0.87. In an operational setting, this could be

used to greatly reduce the amount of traffic needing to be analyzed. SAL can be used as a first

pass over the data using space and temporally-efficient streaming algorithms. After down-selecting,

more expensive algorithms can be applied to the remaining data.

In terms of scalability using SAM, we are able to scale to 61 nodes and 976 cores, obtaining

a throughput of 373,000 netflows per second or 32.2 billion per day. On completely load-balanced

data, we obtain greater efficiency out to 64 nodes than the real data, indicating that SAM could be

improved with a more intelligent partitioning strategy. In the end, we’ve shown SAL can be used

to succinctly express machine learning pipelines and that our implementation can scale to large

problem sizes.

Chapter 7

Temporal Subgraph Matching

In this chapter we explore the performance obtained by SAL and SAM with regards to

temporal subgraph matching. We focus on the temporal triangle query presented earlier in Chapter

2 but repeated in its entirety in Listing 7.1 below for convenience. The query is looking for a set

of three vertices, x1, x2, and x3, where there are three edges forming the triangle: x1 → x2 →

x3→ x1. Also, edge e1 happens before e2, which happens before e3. All together the edges must

occur within a ten second window. We chose this query because it complex enough to meaningfully

exercise the framework, but simple enough that it could be a component of a query.

The contributions of this chapter are the following:

• We present scaling performance of SAM, showing that it continues to increase throughput

to 128 nodes or 2560 cores, the maximum cluster size cluster we could reserve.

• We also compare SAL and SAM to Apache Flink [12], which is another framework designed

and built for streaming applications. The comparison reveals that SAM and Flink process

different strengths dependent on the parameter regime. SAM does better when triangles

occur frequently (roughly greater than five triangles per second per node), while Flink

achieves the best throughput when triangles are rare. Also, while SAM scales to 128 nodes,

Flink plateaus after 32 nodes.

• SAL again provides savings in terms of programming complexity. The full SAL program

for temporal triangles requires 13 lines of code while a direct implementation using SAM

93

directly takes 238 lines and the Apache Flink approach is 228 lines long.

• We simulate vertex constraints that would filter out many intermediate results. In this

situation SAM obtains a rate of 1 million netflows per second or 91.8 billion netflows per

day.

Listing 7.1: Temporal Triangle Query

1 Netf lows = VastStream (” l o c a l h o s t ” , 9999 ) ;

2 PARTITION Netf lows By SourceIp , DestIp ;

3 HASH SourceIp WITH IpHashFunction ;

4 HASH DestIp With IpHashFunction ;

5 SUBGRAPH ON Netf lows WITH source ( SourceIp ) AND t a r g e t ( DestIp )

6 {

7 x1 e1 x2 ;

8 x2 e2 x3 ;

9 x3 e3 x1 ;

10 s t a r t ( e3 ) s t a r t ( e1 ) <= 10 ;

11 s t a r t ( e1 ) < s t a r t ( e2 ) ;

12 s t a r t ( e2 ) < s t a r t ( e3 ) ;

13 }

For all of our experiments in this chapter we again use Cloudlab [168], In particular we make

use of the Wisconsin cluster using node type c220g5, which has two Intel Xeon Silver 4114 10-core

CPUs at 2.20 GHz and 192 GB ECC DDR4-2666 memory per node. We run experiments up to

128 nodes or 2560 cores. This is in contrast to Chapter 6 which had 16 cores per node.

Section 7.1 explores hyperparameters that were imperative for SAM scaling. Section 7.2

discusses the Apache Flink implementation of the triangle query that we used for comparison.

Then in Section 7.3 we present the comparison between SAM and Flink running the same query.

Section 7.4 simulates vertex constraints that would filter out many initial intermediate results,

giving us an idea of the performance when the subgraph pattern is relatively rare.

94

Figure 7.1: We vary the number of pull threads and the number of push sockets (NPS) for theRequestCommunicator to determine the best setting for maximum throughput. The RequestMapobject utilizes the RequestCommunicator during the process(edge) function.

7.1 SAM Parameter Exploration

In doing performance analysis of SAM and the algorithm outlined in Section 5.3.2, we found

that a large portion of the time was spent in the process(edge) function of the RequestMap

object. In particular, the time spent sending edges owned by one node to other nodes was significant.

As such it became the focus of performance enhancements.

The solution we developed, discussed earlier in Section 5.3.1.3 and shown in Figure 5.2, was

to create multiple push sockets to send data to an individual node, and also multiple pull threads

to receive the data. Having multiple push sockets prevents hot spots, where multiple threads are

trying to use the same socket. Multiple pull threads allows data ingestion to happen in parallel.

To find the ideal combination of push sockets and pull threads, we performed the following

experiment. We ran the temporal triangle query of Listing 2.3 with the following parameters:

• The table capacity for the ResultMap, CSR, CSC, and RequestMap were all set to be

2 million.

• The number of unique vertices, |V |, was set to 50,000.

• The rate at which edges were produced was 1000 per second.

95

• The width of the time window was 11 seconds. If an item is more than 11 seconds old, it

is deleted. Since the length of the query was 10 seconds, this gives us a little buffer if data

comes in late from other nodes.

• We allocated a 64 node cluster on Wisconsin. As mentioned earlier, these nodes have 20

cores.

With the above parameters, we then varied the number of pull threads from 1 to 24 and

the number of push sockets from 1 to 8. Figure 7.1 displays the average time spent in Re-

questMap.process(edge), the function that calls RequestCommunicator.send(message) to

send out edges to other nodes according to recorded requests. The dominating factor to reduce the

time spent in RequestMap.process(edge) was to increase the number of pull threads to around

16 threads. No sizeable benefit is garnered by adding more than 16 threads, likely due to the fact

that each node had 20 total cores. We did do some experiments with more than 24 threads, but

the constant thread context switching became overwhelming and SAM could not keep pace.

The other trend is some benefit by adding multiple push sockets. There is a large jump in

performance from 1 to 2 push sockets. The trend continues, and is especially clear when there is

just one pull thread. However, at 16 pull threads, the distinction becomes more muddled.

There are some oddities we are currently at a loss for explaining. For example, when the

number of push sockets is 3, there is a large dip when pull threads is 10. This behavior appears

to be repeatable but we have not investigated thoroughly. Regardless, using 16 pull threads and 4

push sockets appears to be a good parameter setting for this experiment, and that is what we used

for the Flink/SAM comparison in Section 7.3.

As a result of our technique, we were able to increase both the network and processor utiliza-

tion. Using one push socket per node in the cluster and one pull thread for all receiving sockets, we

sent about 20-30 thousand packets a second. With 16 pull threads and 4 push sockets, we pushed

that up to about 225 thousand packets a second. Also, we increased how much the CPU is being

utilized. It went from about 2-3 cores to almost full utilization at 15-20 cores. With this increased

96

utilization we were able to obtain scaling on problems to 128 nodes, as we will see in Sections 7.3

and 7.4.

7.2 Apache Flink

We have currently mapped SAL into SAM as the implementation. SAL is converted into

SAM code using the Scala Parser Combinator [175]. However, we wanted to compare how another

framework would perform. It is certainly possible to map SAL into other languages and frameworks.

A prime candidate for evaluation is Apache Flink [12], which was custom designed and built with

streaming applications in mind. In this section, we outline the approach taken for implementing the

triangle query of Listing 2.3 using Apache Flink. Once the groundwork has been laid, the actual

specification of the algorithm is relatively succinct. The java code using Flink 1.6 is presented in

Listing 7.2. The full code is provided in Appendix A.2.

97

Listing 7.2: Apache Flink Triangle Implementation

1 f ina l StreamExecutionEnvironment env =

2 StreamExecutionEnvironment . getExecutionEnvironment ( ) ;

3 env . s e t P a r a l l e l i s m ( numSources ) ;

4 env . se tSt reamTimeCharacter i s t i c ( T imeCharacte r i s t i c . EventTime ) ;

5

6 Netf lowSource net f lowSource = new Netf lowSource ( numEvents , numIps , r a t e ) ;

7 DataStreamSource<Netflow> ne t f l ows = env . addSource ( net f lowSource ) ;

8

9 DataStream<Triad> t r i a d s = net f l ows

10 . keyBy (new DestKeySelector ( ) )

11 . i n t e r v a l J o i n ( ne t f l ows . keyBy ( new SourceKeySe lector ( ) ) )

12 . between (Time . m i l l i s e c o n d s ( 0 ) ,

13 Time . m i l l i s e c o n d s ( ( long ) queryWindow ∗ 1000))

14 . p roce s s (new EdgeJoiner ( queryWindow ) ) ;

15

16 DataStream<Triangle> t r i a n g l e s = t r i a d s

17 . keyBy (new TriadKeySelector ( ) )

18 . i n t e r v a l J o i n ( ne t f l ows . keyBy (new LastEdgeKeySelector ( ) ) )

19 . between (Time . m i l l i s e c o n d s ( 0 ) ,

20 Time . m i l l i s e c o n d s ( ( long ) queryWindow ∗ 1000))

21 . p roce s s (new TriadJo iner ( queryWindow ) ) ;

22

23 t r i a n g l e s . writeAsText ( outputFi le , Fi leSystem . WriteMode .OVERWRITE) ;

Lines 1 - 4 define the streaming environment, such as the number of parallel sources (generally

the number of nodes in the cluster) and that event time of the netflow should be used as the

timestamp (as opposed to ingestion time or processing time). Lines 6 - 7 create a stream of

netflows. NetflowSource is a custom class that generates the netflows in the same manner as

was used for the SAL/SAM experiments. Lines 9 - 14 finds triads, which are sets of two connected

98

edges that fulfill the temporal requirements. The variable queryWindow on line 13 was set to be

ten in our experiments, corresponding to a temporal window of 10 seconds.

Line 11 uses an interval join. Flink has different types of joining approaches. The one that

was most appropriate for our application is the interval join. It takes elements from two streams (in

this case the same stream netflows), and defines an interval of time over which the join can occur.

In our case, the interval starts with the timestamp of the first edge and it ends queryWindow

seconds later. This join is a self-join: it merges the netflows data stream keyed by the destination

IP with the netflows data stream keyed by the source IP. In the end we have a stream of pairs of

edges with three vertices of the form A→ B → C.

Once we have triads that fulfill the temporal constraints, we can define the last piece: finding

another edge that completes the triangle, C → A. Lines 16 - 21 performs another interval join, this

time between the triads and the original netflows data streams. The join is defined to occur such

that the time difference between the starting time of the last edge and the starting time of the first

edge must be less than queryWindow seconds.

Overall, the complete code is 228 lines long, and can be found in Appendix A.2. This number

includes the code that would likely need to be generated were a mapping between SAL and Flink

completed. This includes classes to do key selection (SourceKeySelector, DestKeySelector,

LastEdgeKeySelector, TriadKeySelector), the Triad and Triangle classes, and classes to

perform joining (EdgeJoiner and TriadJoiner). However, the code for the Netflow and the

NetflowSource classes were not included, as those classes would likely be part of the library as

they are with SAM. For comparison, the SAL version of this query is 13 lines long.

7.3 Comparison Results: SAM vs Flink

In this section we compare the performance of SAM with the custom Apache Flink imple-

mentation. In both cases, we run the triangle query of Listing 2.3, which matches triangles that

occur within ten seconds of the start time of the first edge. We examine weak scaling, wherein the

number of elements per node is kept constant as we increase the number of nodes. In all runs, each

99

node is fed 2,500,000 edges. The edges are netflows, as that is our intended initial target domain.

There are many different formats for netflows. We use the format used by the VAST Challenge

2013: Mini-Challenge 3 dataset [201], where each netflow has 19 fields. However, for this experi-

ment we only change the source and destination IPs, which are randomly selected from a pool of

|V | vertices for each netflow generated. We find the highest rate achievable by each approach by

incrementally increasing the rate at which edges are produced until the throughput is too great for

the method to keep pace.

For the Apache Flink runs we set both the job manager and task manager heap sizes to be

32 GBs. The number of taskmanager task slots was set to 1. As we only ever ran one Flink job

at a time, this seemed to be the most appropriate setting. We set the number of sources to be the

size of the cluster.

We can obtain an estimate on the number triangles as a function of r, the rate of edges

produced per second, |V |, the number of unique vertices, |E|, the number of edges, t, the length of

the query in seconds, and n, the number of nodes in the cluster.

Theorem 2. Let r be the rate at which edges are presented per node. Let |V | be the number of

vertices from which target and source vertices are randomly selected. Let t be the length of the

query in seconds for the triangle query from Listing 2.3. Let n be the number of nodes in the

cluster. Assuming a continuous stream of edges, if we select a contiguous set of |E| edges from

each node from the stream of edges, the total number of expected triangles fulfilling the query of

Listing 2.3 is n|E|2|V |

(nrt|V |

)2.

Proof. For each edge ea that is presented in the stream, there is a set, Eb, of edges that occur after

ea that fulfill the temporal constraints, where |Eb| = nrt. For each eb ∈ Eb, the probability of the

target(ea) being equal to source(eb) is 1|V | . Thus the expected number of triads matching the first

two edges of Listing 2.3 for each edge eb is nrt|V | . For each matching triad, there is a set, Ec, of

edges that occur after eb that fulfill the temporal constraints. However, since triads are expected

to occur uniformly over the time period t, and the query must be completed within t seconds, each

100

triad does not have the same expected number of edges in Ec. For the first triad, the expected

number of matching edges is nrt|V |2 , because the first triad has the entire t seconds worth of edges to

match against, and each edge must fulfill both target(eb) = source(ec) and target(ec) = source(ea).

The last triad occurs near the end of the time interval, and is expected to have no edges to match

against, so there are 0 expected triangles. The series of expected triangles from the first triad to the

last forms an arithmetic sequence, which is of length nrt|V | , so overall the total number of triangles is

12|V |

(nrt|V |

)2. Since there are n nodes, each producing |E| edges, then the total number of expected

triangles over all the edges is n|E|2|V |

(nrt|V |

)2For both implementations, we kept increasing the rate until the run time exceeded |V |/r+15

seconds, where r is the rate of edges produced per second. Latencies longer than 15 seconds usually

indicated that the implementation was unable to keep pace. Also, we stopped increasing the rate

when the number of triangles was less than 85% of the expected value. Generally the solutions stay

above this value unless they are struggling.

Figure 7.2 shows the overall max aggregate throughput achieved by SAM and Flink. We

varied the number of vertices, |V |, to be 5,000, 10,000, 25,000, 50,000, 100,000, and 150,000.

For |V | = 5, 000 or 10, 000, SAM obtained the best overall rate, continuing to garner increasing

aggregate throughput to 120 nodes (we could only obtain 120 nodes for this experiment, the next

section reports 128 node performance). Flink struggled with more than 32 nodes. While Flink

continued to report results, the percentage of expected triangles fell dramatically to about 60-70%

for 64 nodes, and around 50% for 120 nodes. While Flink was not able to scale past 32 nodes,

it showed the best overall throughput for |V | = 50, 000 and 150, 000. |V | = 25, 000 was near the

middle ground, with Flink performing slightly better, with 32 nodes obtaining an aggregate rate of

60,800 edges per second while SAM obtained 60,000 edges per second on 120 nodes.

Figure 7.3 provides another view of the data. We show the max aggregate throughput versus

a measure of the complexity of the problem: the total number of triangles. The graph emphasizes

the fact that SAM performs better with more complex problems, i.e. with more total triangles to

101

Figure 7.2: Shows the maximum throughput for SAM and Flink. We vary the number of nodeswith values of 16, 32, 64, and 120. We vary the number of vertices, |V |, between 25,000 and 150,000.Flink has the best throughput with 32 nodes, after which adding additional nodes hurts the overallaggregate rate and the number of found triangles continues to decrease. SAM is able to scale to120 nodes.

Figure 7.3: Another view of the data where we plot the aggregate throughput versus the totalnumber of triangles to find. This view emphasizes that SAM does better with more complexproblems, i.e. more triangles.

102

Figure 7.4: Another look at the maximum throughput for SAM and Flink. Here we relax therestriction that the number of reported triangles must be higher than 85% of expected, but keepthe latency restrictions that the computation must complete within 15 seconds of termination ofinput. Flink struggles increasing to 64 nodes, and then drops consistently with 120 nodes. SAMcontinues to increase throughput through 120 nodes.

Figure 7.5: This graph shows the ratio of found triangles divided by the number of expectedtriangles given by Theorem 2. The rate for each method was the same as from Figure 7.4. SAMstays fairly consistent throughout the runs, with an average of 0.937. For Flink, the ratio decreasesas the number of nodes increases. The average ratio starts at 0.873 with 16 nodes, and thendecreases to an average of 0.521 with 120 nodes.

103

find.

Figure 7.4 displays the max aggregate throughput again, but this time without the restriction

that the number of reported triangles must be higher than 85% of expected. This figure emphasizes

the fact that Flink struggles after 32 nodes. Figure 7.5 examines the ratio of found triangles divided

by expected triangles for each approach. SAM stays fairly consistent, with an average of 93.7% of

expected. Flink decreases as the number of nodes increases, starting with 87.3% of expected on

average with 16 nodes, then decreasing to 52.1% of expected with 120 nodes.

To summarize this section, SAM and Flink have competing strengths. SAM excels when the

triangle formation rate is greater than 5 per second per node, and SAM scales to 120 nodes. Flink

does better when triangles are relatively rare, but cannot scale past 32 nodes. Both approaches

show some loss due to network latencies. If some data arrives too late, it is not included in the

calculations for either SAM or Flink. However, SAM has an advantages in this regard, showing

consistent results throughout the processor count range which is higher than Flink.

7.4 Simulating other Constraints

We envision SAL being used to not only select queries based on the subgraph structure, but

also on vertex-based constraints. The Watering Hole query of Listing 2.1 is a example. There are

constraints defined on both the bait and controller vertices. In this section we simulate vertex

constraints by randomly dropping the first edge of the triangle query, i.e. we process the netflow

through SAM, but we force the edge to not match the first edge of the triangle query. This will

give us an idea of SAM’s performance when the subgraph queries are more selective. The triangle

query of Listing 2.3 as specified creates an intermediate result for every consumed netflow since

there are no vertex constraints.

We compare SAM’s performance over three settings: keeping the first edge all of the time,

1% of the time, and .1% of the time. We vary |V | to be 25, 000, 50, 000, 100, 000, and 150, 000.

Figure 7.6 presents the results. The general trends are as expected. As we increase |V |, the number

of triangles created decreases, and the rate goes up. Also, as we increase the rate at which edges

104

are kept causes a decrease in the maximum achievable rate, again because more triangles are being

produced, which causes a greater computational burden. The maximum achievable rate is 1,062,400

netflows per second, when we keep 0.01% of the first edges and |V | = 150, 000.

Figure 7.6: We keep the first edges of queries (kq) with probabilities {0.01, 0.1, 1}. We also vary|V | to range from 25,000 to 150,000. This figure is a plot of edges/second versus the number ofnodes.

As the number of triangles is likely the predominant factor in determining maximum through-

put, another way of looking at the data is to plot the rate versus triangles/second, which we do in

Figure 7.7. We only examine the runs with 128 nodes, and vary the number of kept queries. Also,

in this figure we calculate the rate as edges/day. As expected, the rate at which triangles occur is

inversely proportional to maximum throughput. The maximum rate achieved is 91.8 billion edges

per day.

7.5 Summary

In this chapter we presented performance of SAM on subgraph queries, namely a temporal

triangle query, and compare SAM to a custom solution written for Apache Flink. SAM excels

when triangles occur frequently (greater than 5 per second per node) while Flink is best when

triangles are rare. For all parameter settings, SAM showed improved throughput to 128 nodes

or 2560 cores, while Flink’s rate started to decrease after 32 nodes. For more selective queries,

105

Figure 7.7: We vary the rate at which the first edges of queries are kept (kq = keep queries) andplot the edges per day versus the rate at which triangles occur.

SAM is able to obtain an aggregate rate 1 million netflows per second. While we have presented

SAM as the implementation for SAL, Apache Flink shows promise as another potential underlying

backend. Future work understanding the performance differences between both platforms will likely

be beneficial for both approaches.

We also have further evidence that programming in SAL is an efficient way to express stream-

ing subgraph queries. SAL requires only 13 lines of code, while programming directly with SAM is

238 lines and the Flink approach is 228 lines.

Chapter 8

Conclusions

In this dissertation we have presented a new domain specific language called the Streaming

Analytics Language or SAL. SAL ingests streaming data in the form of tuples, and allows analysis

that combines graph theory, streaming, temporal information, and machine learning. In particular

we examined streaming cyber data and showed how SAL can be used to express a wide range

of queries relevant to cyber security and detecting malicious patterns. We used Verizon’s Data

Breach Investigation Report as a backdrop for discussing categories of attacks involving computer

networks, and showed how SAL can be utilized to detect malicious patterns. We also showed that

SAL programs are succinct, requiring 10-25 fewer lines of code than writing using our library, the

Streaming Analytics Machine or SAM, or with another framework such as Apache Flink.

As another contribution, we have presented SAM which is one possible implementation target

of SAL. We examined the scaling performance of SAM on two different problems: a vertex-centric

computation to detect botnet activity and a temporal subgraph query. We showed scaling to 61

nodes or 976 cores for the vertex-centric computations, a scaling result not seen for related netflow

analysis. We also garnered an average area under the curve of the receiver operating characteristic

of 0.87 in detecting botnet activity. For subgraph matching, we showed scaling to 128 nodes or

2560 cores for temporal triangle finding. An alternative approach using custom written Apache

Flink code could not scale past 32 nodes, and SAM had an overall higher throughput when the

triangles occurred greater than 5 per second per node.

For future work in terms of the underlying implementation, a fecund avenue could be to

107

combine the strengths of our approach, SAM, with that of streaming frameworks such as Apache

Flink. We showed that SAM excels at subgraph matching when the pattern is frequent; however,

Flink was much more efficient with rare patterns. If we can tune the approach based on the

distribution present in the data, SAL would be able to handle high throughput streaming data

with greater robustness.

In terms of the expressibility of SAL, we primarily focused on homogeneous data - all data

sources were of the same type, namely netflows. However, cyber has many different levels and

layers that could be combined in the same workflow. For example, host-based intrusion detection

systems can generate alerts that can then be combined with network-level analysis. In terms of

Point of Sale breaches, unusual memory activity by a process on a Point of Sale machine might be

a weak indicator, but when combined with corresponding network traffic that indicates exfiltration

from the same machine, could be a powerful way to combine alerts into the same picture. SAL

could be used to succinctly express this combination of patterns, allowing for fewer false positives

and overall stronger indicators of malicious activity.

Finally, network telemetry is another avenue of research where SAL could be extended. Our

work so far has assumed that netflow generation has already been taken care of at variety of sensors.

We then ingest the netflows and perform our analysis with SAL. However, switches today allow for

configurable programmable packet forwarding, opening opportunities for SAL to be compiled as a

combination of switch computing resources and an analytics plane.

Bibliography

[1] Resource description framework. W3C Recommendation, February 2014.

[2] Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uur etintemel, Mitch Cherniack,Jeong Hyon Hwang, Wolfgang Lindner, Anurag S. Maskey, Alexander Rasin, Esther Ryvkina,Nesime Tatbul, Ying Xing, and Stan Zdonik. The design of the borealis stream processingengine. In 2nd Biennial Conference on Innovative Data Systems Research, CIDR 2005, pages277–289, 2005.

[3] Daniel J. Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, SangdonLee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: A new model andarchitecture for data stream management. The VLDB Journal, 12(2):120–139, August 2003.

[4] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good-fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, LukaszKaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mane, Rajat Monga, Sherry Moore,Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, OriolVinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software availablefrom tensorflow.org.

[5] A. Ahmed, A. Lisitsa, and C. Dixon. A misuse-based network intrusion detection systemusing temporal logic and stream processing. In Network and System Security (NSS), 20115th International Conference on, pages 1–8, Sept 2011.

[6] Faruk Akgul. ZeroMQ. Packt Publishing, 2013.

[7] Muhammad Intizar Ali, Feng Gao, and Alessandra Mileo. Citybench: A configurable bench-mark to evaluate rsp engines using smart city datasets. In International Semantic WebConference, 2015.

[8] Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, andBurton Smith. The tera computer system. SIGARCH Comput. Archit. News, 18(3b):1–6,June 1990.

[9] Darko Anicic, Paul Fodor, Sebastian Rudolph, and Nenad Stojanovic. Ep-sparql: A unifiedlanguage for event processing and stream reasoning. In Proceedings of the 20th International

109

Conference on World Wide Web, WWW ’11, pages 635–644, New York, NY, USA, 2011.ACM.

[10] Apache. Apache apex. apex.apache.org, 2019.

[11] Apache. Apache beam. https://beam.apache.org, 2019.

[12] Apache. Apache flink. flink.apache.org, 2019.

[13] Apache. Apache gearpump. gearpump.apache.org, 2019.

[14] Arvind Arasu and Gurmeet Singh Manku. Approximate counts and quantiles over slidingwindows. In Proceedings of the Twenty-third ACM SIGMOD-SIGACT-SIGART Symposiumon Principles of Database Systems, PODS ’04, pages 286–296, New York, NY, USA, 2004.ACM.

[15] Aurelius. Titan. http://thinkaurelius.github.io/titan/, 2015.

[16] Brain Babcock, Mayur Datar, Rajeev Motwani, and Liadan O’Callaghan. Maintainingvariance and k-medians over data stream windows. In Proceedings of the Twenty-secondACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’03, pages 234–243, New York, NY, USA, 2003. ACM.

[17] D. A. Bader, J. Berry, A. Amos-Binks, D. Chavarra-Miranda, C. Hastings, K. Madduri, andS. C. Poulous. Stinger: Spatio-temporal interaction networks and graphs (sting) extensiblerepresentation. Technical report, Georgia Institute of Technology, 2009.

[18] David A. Bader and Kamesh Madduri. Snap, small-world network analysis and partitioning:An open-source parallel graph framework for the exploration of large-scale networks. InIPDPS, pages 1–12. IEEE, 2008.

[19] Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. Groute: An asynchronousmulti-gpu programming model for irregular computations. In Proceedings of the 22Nd ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’17, pages235–248, New York, NY, USA, 2017. ACM.

[20] Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. Groute: An asynchronousmulti-gpu programming model for irregular computations. SIGPLAN Not., 52(8):235–248,January 2017.

[21] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American,284(5):34–43, May 2001.

[22] Jonathan W. Berry, Bruce Hendrickson, Simon Kahan, and Petr Konecny. Software and algo-rithms for graph queries on multithreaded architectures. Parallel and Distributed ProcessingSymposium, International, 0:495, 2007.

[23] Leyla Bilge, Davide Balzarotti, William Robertson, Engin Kirda, and Christopher Kruegel.Disclosure: Detecting botnet command and control servers through large-scale netflow analy-sis. In Proceedings of the 28th Annual Computer Security Applications Conference, ACSAC’12, pages 129–138, New York, NY, USA, 2012. ACM.

110

[24] David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, April 2012.

[25] Andre Bolles, Marco Grawunder, and Jonas Jacobi. Streaming sparql extending sparql toprocess data streams. In Proceedings of the 5th European Semantic Web Conference on TheSemantic Web: Research and Applications, ESWC’08, pages 448–462, Berlin, Heidelberg,2008. Springer-Verlag.

[26] Dario Bonfiglio, Marco Mellia, Michela Meo, Dario Rossi, and Paolo Tofanelli. Revealingskype traffic: When randomness plays with you. SIGCOMM Comput. Commun. Rev.,37(4):37–48, August 2007.

[27] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, ColeSchlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. P4: Pro-gramming protocol-independent packet processors. SIGCOMM Comput. Commun. Rev.,44(3):87–95, July 2014.

[28] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001.

[29] Yingyi Bu, Vinayak Borkar, Jianfeng Jia, Michael J. Carey, and Tyson Condie. Pregelix:Big(ger) graph analytics on a dataflow engine. Proc. VLDB Endow., 8(2):161–172, October2014.

[30] A. L. Buczak and E. Guven. A survey of data mining and machine learning methods forcyber security intrusion detection. IEEE Communications Surveys Tutorials, 18(2):1153–1176, Secondquarter 2016.

[31] Ed Bullmore and Olaf Sporns. Complex brain networks: graph theoretical analysis of struc-tural and functional systems. Nature Reviews Neuroscience, 10:186–198, 2009.

[32] Aydın Buluc and John R Gilbert. The combinatorial blas: Design, implementation, andapplications. Int. J. High Perform. Comput. Appl., 25(4):496–509, November 2011.

[33] Vernon K.C. Bumgardner and Victor W. Marek. Scalable hybrid stream and hadoop net-work analysis system. In Proceedings of the 5th ACM/SPEC International Conference onPerformance Engineering, ICPE ’14, pages 219–224, New York, NY, USA, 2014. ACM.

[34] Elie Bursztein, Jonathan Aigrain, Angelika Moscicki, and John C. Mitchell. The end is nigh:Generic solving of text-based captchas. In Proceedings of the 8th USENIX Conference onOffensive Technologies, WOOT’14, pages 3–3, Berkeley, CA, USA, 2014. USENIX Associa-tion.

[35] Jean-Paul Calbimonte, Oscar Corcho, and Alasdair J. G. Gray. Enabling ontology-basedaccess to streaming data sources. In Proceedings of the 9th International Semantic WebConference on The Semantic Web - Volume Part I, ISWC’10, pages 96–111, Berlin, Heidel-berg, 2010. Springer-Verlag.

[36] Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Sei-dman, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Monitoring streams: A newclass of data management applications. In Proceedings of the 28th International Conferenceon Very Large Data Bases, VLDB ’02, pages 215–226. VLDB Endowment, 2002.

111

[37] Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. Powerlyra: Differentiated graphcomputation and partitioning on skewed graphs. In Proceedings of the Tenth EuropeanConference on Computer Systems, EuroSys ’15, pages 1:1–1:15, New York, NY, USA, 2015.ACM.

[38] J. Cheng, Q. Liu, Z. Li, W. Fan, J. C. S. Lui, and C. He. Venus: Vertex-centric streamlinedgraph computation on a single pc. In 2015 IEEE 31st International Conference on DataEngineering, pages 1131–1142, April 2015.

[39] Mitch Cherniack, Hari Balakrishnan, Magdalena Balazinska, Donald Carney, UgurCetintemel, Ying Xing, and Stan Zdonik. Scalable Distributed Stream Processing. In CIDR2003 - First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, Jan-uary 2003.

[40] Francois Chollet et al. Keras. https://keras.io, 2015.

[41] James R. Clapper. Statement for the record, worldwide threat assessment of the us intelligencecommunity, January 2014.

[42] P. Clifford and I. A. Cosma. A statistical analysis of probabilistic counting algorithms. ArXive-prints, January 2008.

[43] Cisco Corporation. Cisco visual networking index: Forcast and methodology, 2014-2019,2015.

[44] Matthew Crosston. World gone cyber mad: How mutually assured debilitation is the besthope for cyber deterrence. Strategic Studies Quarterly, 5(1), Spring 2011.

[45] Damballa. The command structure of the aurora botnet. Technical report, Damballa, 2010.

[46] Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statis-tics over sliding windows: (extended abstract). In Proceedings of the Thirteenth AnnualACM-SIAM Symposium on Discrete Algorithms, SODA ’02, pages 635–644, Philadelphia,PA, USA, 2002. Society for Industrial and Applied Mathematics.

[47] Mayur Datar and Rajeev Motwani. The sliding-window computational model and results. InCharu C. Aggarwal, editor, Data Streams: Models and Algorithms (Advances in DatabaseSystems). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[48] Mayur Datar and S. Muthukrishnan. Estimating Rarity and Similarity over Data StreamWindows, pages 323–335. Springer Berlin Heidelberg, Berlin, Heidelberg, 2002.

[49] Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden,Marc Snir, and Keshav Pingali. Gluon: A communication-optimizing substrate for distributedheterogeneous graph analytics. In Proceedings of the 39th ACM SIGPLAN Conference onProgramming Language Design and Implementation, PLDI 2018, pages 752–768, New York,NY, USA, 2018. ACM.

[50] Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden,Marc Snir, and Keshav Pingali. Gluon: A communication-optimizing substrate for distributedheterogeneous graph analytics. SIGPLAN Not., 53(4):752–768, June 2018.

112

[51] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters.Commun. ACM, 51(1):107–113, January 2008.

[52] Roger Dingledine, Nick Mathewson, and Paul Syverson. Tor: The second-generation onionrouter. In Proceedings of the 13th Conference on USENIX Security Symposium - Volume 13,SSYM’04, pages 21–21, Berkeley, CA, USA, 2004. USENIX Association.

[53] Hristo Djidjev, Gary Sandine, Curtis Storlie, and Scott Vander Wiel. Graph Based StatisticalAnalysis of Network Traffic. In Proceedings of the Ninth Workshop on Mining and Learningwith Graphs, August 2011.

[54] D. Ediger, K. Jiang, J. Riedy, D. A. Bader, C. Corley, R. Farber, and W. N. Reynolds.Massive social network analysis: Mining twitter for social good. In 2010 39th InternationalConference on Parallel Processing, pages 583–593, Sep. 2010.

[55] David Ediger, Karl Jiang, E. Jason Riedy, and David A. Bader. Massive streaming dataanalytics: A case study with clustering coefficients. In IPDPS Workshops, pages 1–8. IEEE,2010.

[56] Github Engineering. February 28th ddos incident report.https://githubengineering.com/ddos-incident-report/, 3 2018.

[57] J. Erman, A. Mahanti, and M. Arlitt. Qrp05-4: Internet traffic identification using machinelearning. In IEEE Globecom 2006, pages 1–6, Nov 2006.

[58] FBI. The insider threat.

[59] Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. Ongraph problems in a semi-streaming model. Theor. Comput. Sci., 348(2):207–216, December2005.

[60] James Wood Forsyth Jr. and Billy E. Pope. Structural causes and cyber effects why interna-tional order is inevitable in cyberspace. Strategic Studies Quarterly, 8(4), Winter 2014.

[61] James Wood Forsyth Jr. and Billy E. Pope. Structural causes and cyber effects: A responseto our critics. Strategic Studies Quarterly, 9(2), Summer 2015.

[62] Davide Francesco Barbieri, Daniele Braga, Stefano Ceri, Emanuele Della Valle, and MichaelGrossniklaus. C-sparql: A continuous query language for rdf data streams. InternationalJournal of Semantic Computing, 4:487, 03 2010.

[63] S. Garca, M. Grill, J. Stiborek, and A. Zunino. An empirical comparison of botnet detectionmethods. Computers & Security, 45(Supplement C):100 – 123, 2014.

[64] Nishant Garg. Apache Kafka. Packt Publishing, 2013.

[65] Nishant Garg. Apache Kafka. Packt Publishing, 2013.

[66] Steve H. Garlik, Andy Seaborne, and Eric Prud’hommeaux. SPARQL 1.1 Query Language.http://www.w3.org/TR/sparql11-query/.

113

[67] Bugra Gedik, Henrique Andrade, Kun-Lung Wu, Philip S. Yu, and Myungcheol Doo. Spade:The system s declarative stream processing engine. In Proceedings of the 2008 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’08, pages 1123–1134, New York,NY, USA, 2008. ACM.

[68] Abdullah Gharaibeh, Elizeu Santos-Neto, Lauro Beltrao Costa, and Matei Ripeanu. Efficientlarge-scale graph processing on hybrid CPU and GPU systems. CoRR, abs/1312.3018, 2013.

[69] Iffat A. Gheyas and Ali E. Abdallah. Detection and prediction of insider threats to cybersecurity: a systematic literature review and meta-analysis. Big Data Analytics, 1(1):6, 2016.

[70] J.R. Gilbert, V.B. Shah, and S. Reinhardt. A unified framework for numerical and combina-torial computing. Computing in Science Engineering, 10(2):20–25, March 2008.

[71] Jan Goebel and Thorsten Holz. Rishi: Identify bot contaminated hosts by irc nicknameevaluation. In Proceedings of the First Conference on First Workshop on Hot Topics inUnderstanding Botnets, HotBots’07, pages 8–8, Berkeley, CA, USA, 2007. USENIX Associ-ation.

[72] Lukasz Golab, David DeHaan, Erik D. Demaine, Alejandro Lopez-Ortiz, and J. Ian Munro.Identifying frequent items in sliding windows over on-line packet streams. In Proceedingsof the 3rd ACM SIGCOMM Conference on Internet Measurement, IMC ’03, pages 173–178,New York, NY, USA, 2003. ACM.

[73] Lukasz Golab, David DeHaan, Erik D. Demaine, Alejandro Lopez-Ortiz, and J. Ian Munro.Identifying frequent items in sliding windows over on-line packet streams. In Proceedingsof the 3rd ACM SIGCOMM Conference on Internet Measurement, IMC ’03, pages 173–178,New York, NY, USA, 2003. ACM.

[74] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Pow-ergraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, pages17–30, Berkeley, CA, USA, 2012. USENIX Association.

[75] Eric Goodman, M. Nicole Lemaster, and Edward Jimenez. Scalable hashing for shared mem-ory supercomputers. In Proceedings of 2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis, SC ’11, pages 41:1–41:11, New York, NY,USA, 2011. ACM.

[76] Eric L. Goodman and Dirk Grunwald. Using vertex-centric programming platforms to im-plement sparql queries on large graphs. In Proceedings of the 4th Workshop on IrregularApplications: Architectures and Algorithms, IA3 ’14, pages 25–32, Piscataway, NJ, USA,2014. IEEE Press.

[77] Eric L. Goodman and Edward Jimenez. Triangle finding: How graph theory can help thesemantic web. In Joint Workshop on Scalable and High-Performance Semantic Web Systems,2012.

[78] Google. Cloud dataflow. https://cloud.google.com/dataflow, 2019.

114

[79] I. Gorton, Z. Huang, Y. Chen, B. Kalahar, S. Jin, D. Chavarra-Miranda, D. Baxter, andJ. Feo. A high-performance hybrid computing approach to massive contingency analysis inthe power grid. In 2009 Fifth IEEE International Conference on e-Science, pages 277–283,Dec 2009.

[80] Oded Green, Robert McColl, and David A. Bader. A fast algorithm for streaming between-ness centrality. In Proceedings of the 2012 ASE/IEEE International Conference on SocialComputing and 2012 ASE/IEEE International Conference on Privacy, Security, Risk andTrust, SOCIALCOM-PASSAT ’12, pages 11–20, Washington, DC, USA, 2012. IEEE Com-puter Society.

[81] Douglas Gregor and Andrew Lumsdaine. The parallel bgl: A generic library for distributedgraph computations. In In Parallel Object-Oriented Scientific Computing, 2005.

[82] Gremlin. https://github.com/tinkerpop/gremlin/wiki, 2015.

[83] Samuel Grossman, Heiner Litz, and Christos Kozyrakis. Making pull-based graph process-ing performant. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, PPoPP ’18, pages 246–260, New York, NY, USA, 2018.ACM.

[84] Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee. Botminer: Clustering analysisof network traffic for protocol- and structure-independent botnet detection. In Proceedingsof the 17th Conference on Security Symposium, SS’08, pages 139–154, Berkeley, CA, USA,2008. USENIX Association.

[85] Guofei Gu, Junjie Zhang, and Wenke Lee. Botsniffer: Detecting botnet command and controlchannels in network traffic. In NDSS, 2008.

[86] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. Lubm: A benchmark for owl knowledge basesystems. J. Web Sem., 3(2-3):158–182, 2005.

[87] Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and WalterWillinger. Sonata: Query-driven streaming network telemetry. In Proceedings of the 2018Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18,pages 357–371, New York, NY, USA, 2018. ACM.

[88] Claudio Gutierrez, Carlos Hurtado, and Ro Vaisman. Temporal rdf. In In EuropeanConference on the Semantic Web, pages 93–107, 2005.

[89] Claudio Gutierrez, Carlos A. Hurtado, and Alejandro Vaisman. Introducing time into rdf.IEEE Trans. on Knowl. and Data Eng., 19(2):207–218, February 2007.

[90] Hadoop. hadoop.apache.org, 2019.

[91] A. Hagberg, N. Lemons, A. Kent, and J. Neil. Connected components and credential hoppingin authentication graphs. In 2014 Tenth International Conference on Signal-Image Technologyand Internet-Based Systems, pages 416–423, Nov 2014.

[92] Diana Haidar, Mohamed Medhat Gaber, and Yevgeniya Kovalchuk. Anythreat: An oppor-tunistic knowledge discovery approach to insider threat detection. CoRR, abs/1812.00257,2018.

115

[93] Minyang Han and Khuzaima Daudjee. Giraph unchained: Barrierless asynchronous parallelexecution in pregel-like graph processing systems. Proc. VLDB Endow., 8(9):950–961, May2015.

[94] Richard Harang. Learning and semantics. In Alexander Kott, Cliff Wang, and Robert F.Erbacher, editors, Cyber Defense and Situational Awareness, volume 62 of Advances inInformation Security, pages 201–217. Springer International Publishing, 2014.

[95] Heron. https://github.com/apache/incubator-heron, 2019.

[96] Martin Hirzel, Scott Schneider, and Bugra Gedik. Spl: An extensible language for distributedstream processing. ACM Trans. Program. Lang. Syst., 39(1):5:1–5:39, March 2017.

[97] Emilie Hogan, John R. Johnson, and Mahantesh Halappanavar. Graph coarsening for pathfinding in cybersecurity graphs. In Proceedings of the Eighth Annual Cyber Security andInformation Intelligence Research Workshop, CSIIRW ’13, pages 7:1–7:4, New York, NY,USA, 2013. ACM.

[98] Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. Green-marl: A dsl for easyand efficient graph analysis. In Proceedings of the Seventeenth International Conference onArchitectural Support for Programming Languages and Operating Systems, ASPLOS XVII,pages 349–362, New York, NY, USA, 2012. ACM.

[99] Sungpack Hong, Semih Salihoglu, Jennifer Widom, and Kunle Olukotun. Simplifying scalablegraph processing with a domain-specific language. In Proceedings of Annual IEEE/ACMInternational Symposium on Code Generation and Optimization, CGO ’14, pages 208:208–208:218, New York, NY, USA, 2014. ACM.

[100] HP. Cyber risk report 2015, 2015.

[101] Y. Hu, D. Chiu, and J. C. S. Lui. Application identification based on network behavioralprofiles. In 2008 16th Interntional Workshop on Quality of Service, pages 219–228, June 2008.

[102] Danny Yuxing Huang, Hitesh Dharmdasani, Sarah Meiklejohn, Vacha Dave, Chris Grier,Damon McCoy, Stefan Savage, Nicholas Weaver, Alex C. Snoeren, and Kirill Levchenko.Botcoin: Monetizing stolen cycles. In 21st Annual Network and Distributed System SecuritySymposium, NDSS 2014, San Diego, California, USA, February 23-26, 2014, 2014.

[103] T. Indulekha, G. Aswathy, and P. Sudhakaran. A graph based algorithm for clustering andranking proteins for identifying disease causing genes. In 2018 International Conference onAdvances in Computing, Communications and Informatics (ICACCI), pages 1022–1026, Sep.2018.

[104] SANS Institute. Confirmation of a coordinated attack on the ukrainian power grid, January2016.

[105] Z. Jadidi, V. Muthukkumarasamy, E. Sithirasenan, and M. Sheikhan. Flow-based anomalydetection using neural network optimized with gsa algorithm. In 2013 IEEE 33rd InternationalConference on Distributed Computing Systems Workshops, pages 76–81, July 2013.

116

[106] Z. Jadidi, V. Muthukkumarasamy, E. Sithirasenan, and M. Sheikhan. Flow-based anomalydetection using neural network optimized with gsa algorithm. In 2013 IEEE 33rd InternationalConference on Distributed Computing Systems Workshops, pages 76–81, July 2013.

[107] Zahra Jadidi, Vallipuram Muthukkumarasamy, Elankayer Sithirasenan, and Kalvinder Singh.Performance of flow-based anomaly detection in sampled traffic. JNW, 10(9):512–520, 2015.

[108] Mattijs Jonker, Alistair King, Johannes Krupp, Christian Rossow, Anna Sperotto, and Al-berto Dainotti. Millions of targets under attack: A macroscopic characterization of the dosecosystem. In Proceedings of the 2017 Internet Measurement Conference, IMC ’17, pages100–113, New York, NY, USA, 2017. ACM.

[109] Cliff Joslyn, Sutanay Choudhury, David Haglin, Bill Howe, Bill Nickless, and Bryan Olsen.Massive scale cyber traffic analysis: A driver for graph database research. In FirstInternational Workshop on Graph Data Management Experiences and Systems, GRADES’13, pages 3:1–3:6, New York, NY, USA, 2013. ACM.

[110] U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. Pegasus: Mining peta-scalegraphs. Knowl. Inf. Syst., 27(2):303–325, May 2011.

[111] A. M. Karimi, Q. Niyaz, Weiqing Sun, A. Y. Javaid, and V. K. Devabhaktuni. Distributednetwork traffic feature extraction for a real-time ids. In 2016 IEEE International Conferenceon Electro Information Technology (EIT), pages 0522–0526, May 2016.

[112] David Kennedy, Jim O’Gorman, Devon Kearns, and Mati Aharoni. Metasploit: ThePenetration Tester’s Guide. No Starch Press, San Francisco, CA, USA, 1st edition, 2011.

[113] Alexander D. Kent, Lorie M. Liebrock, and Joshua C. Neil. Authentication graphs: Analyzinguser behavior within an enterprise network. Computers & Security, 48:150 – 166, 2015.

[114] J. Kepner and J. Gilbert. Graph Algorithms in the Language of Linear Algebra. Society forIndustrial and Applied Mathematics, 2011.

[115] S. Khattak, N.R. Ramay, K.R. Khan, A.A. Syed, and S.A. Khayam. A taxonomy of botnetbehavior, detection, and defense. Communications Surveys Tutorials, IEEE, 16(2):898–924,Second 2014.

[116] Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan. Cusha: Vertex-centric graph processing on gpus. In Proceedings of the 23rd International Symposium onHigh-performance Parallel and Distributed Computing, HPDC ’14, pages 239–252, New York,NY, USA, 2014. ACM.

[117] Knoesis. Linkedsensordata, 2010.

[118] P. Kumar and H. H. Huang. G-store: High-performance graph store for trillion-edge process-ing. In SC ’16: Proceedings of the International Conference for High Performance Computing,Networking, Storage and Analysis, pages 830–841, Nov 2016.

[119] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computationon just a pc. In Proceedings of the 10th USENIX Symposium on Operating Systems Designand Implementation (OSDI ’12), Hollywood, October 2012.

117

[120] Danh Le-Phuoc, Minh Dao-Tran, Josiane Xavier Parreira, and Manfred Hauswirth. A nativeand adaptive approach for unified processing of linked streams and linked data. In Proceedingsof the 10th International Conference on The Semantic Web - Volume Part I, ISWC’11, pages370–388, Berlin, Heidelberg, 2011. Springer-Verlag.

[121] Danh Le-Phuoc, Minh Dao-Tran, Minh-Duc Pham, Peter Boncz, Thomas Eiter, and MichaelFink. Linked stream data processing engines: Facts and figures. In Proceedings of the 11thInternational Conference on The Semantic Web - Volume Part II, ISWC’12, pages 300–312,Berlin, Heidelberg, 2012. Springer-Verlag.

[122] Jinsoo Lee, Wook-Shin Han, Romans Kasperovics, and Jeong-Hoon Lee. An in-depth com-parison of subgraph isomorphism algorithms in graph databases. In Proceedings of the 39thinternational conference on Very Large Data Bases, PVLDB’13, pages 133–144. VLDB En-dowment, 2013.

[123] Yeonhee Lee and Youngseok Lee. Toward scalable internet traffic measurement and analysiswith hadoop. SIGCOMM Comput. Commun. Rev., 43(1):5–13, January 2012.

[124] Bingdong Li, Jeff Springer, George Bebis, and Mehmet Hadi Gunes. A survey of networkflow applications. Journal of Network and Computer Applications, 36(2):567 – 581, 2013.

[125] Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Flowradar: A better netflow for datacenters. In 13th USENIX Symposium on Networked Systems Design and Implementation(NSDI 16), pages 311–324, Santa Clara, CA, 2016. USENIX Association.

[126] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, andJoseph M. Hellerstein. Distributed graphlab: A framework for machine learning and datamining in the cloud. Proc. VLDB Endow., 5(8):716–727, April 2012.

[127] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, andJoseph M. Hellerstein. Graphlab: A new framework for parallel machine learning. CoRR,abs/1006.4990, 2010.

[128] Adam Lugowski, David Alber, Aydın Buluc, John R. Gilbert, Steve Reinhardt, Yun Teng,and Andrew Waranis. A flexible open-source toolbox for scalable complex graph analysis. InProceedings of the Twelfth SIAM International Conference on Data Mining (SDM12), pages930–941, April 2012.

[129] Dmitriy Lyubimov and Andrew Palumbo. Apache Mahout: Beyond MapReduce. CreateSpaceIndependent Publishing Platform, USA, 1st edition, 2016.

[130] P. Raghavan M. Henzinger and S. Rajagopalan. Computing on data stream, May 1998.

[131] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, NatyLeiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. InProceedings of the 2010 ACM SIGMOD International Conference on Management of Data,SIGMOD ’10, pages 135–146, New York, NY, USA, 2010. ACM.

[132] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over datastreams. In Proceedings of the 28th International Conference on Very Large Data Bases,VLDB ’02, pages 346–357. VLDB Endowment, 2002.

118

[133] S. Marchal, X. Jiang, R. State, and T. Engel. A big data architecture for large scale securitymonitoring. In 2014 IEEE International Congress on Big Data, pages 56–63, June 2014.

[134] M. Mayhew, M. Atighetchi, A. Adler, and R. Greenstadt. Use of machine learning inbig data analytics for insider threat detection. In MILCOM 2015 - 2015 IEEE MilitaryCommunications Conference, pages 915–922, Oct 2015.

[135] Brian M. Mazanec. Why international order in cyberspace is not inevitable. Strategic StudiesQuarterly, 9(2), Summer 2015.

[136] R. McColl, O. Green, and D.A. Bader. A new parallel algorithm for connected compo-nents in dynamic graphs. In High Performance Computing (HiPC), 2013 20th InternationalConference on, pages 246–255, Dec 2013.

[137] John McCrae. The linked open data cloud, 2019.

[138] Christopher D. McDermott, Farzan Majdani, and Andrei Petrovski. Botnet detection in theinternet of things using deep learning approaches. In 2018 International Joint Conference onNeural Networks, IJCNN 2018, Rio de Janeiro, Brazil, July 8-13, 2018, pages 1–8, 2018.

[139] Frank McSherry, Michael Isard, and Derek G. Murray. Scalability! but at what cost? InProceedings of the 15th USENIX Conference on Hot Topics in Operating Systems, HO-TOS’15, pages 14–14, Berkeley, CA, USA, 2015. USENIX Association.

[140] Natarajan Meghanathan, editor. Graph Theoretic Approaches for Analyzing Large-ScaleSocial Networks. IGI Global, Hershey, PA, 2018.

[141] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Why go logarithmic if we cango linear?: Towards effective distinct counting of search traffic. In Proceedings of the11th International Conference on Extending Database Technology: Advances in DatabaseTechnology, EDBT ’08, pages 618–629, New York, NY, USA, 2008. ACM.

[142] Mark S. Miller and Eric K. Drexler. Markets and computation: Agoric open sys-tems. In Bernardo Huberman, editor, The Ecology of Computation. Elsevier SciencePublishers/North-Holland, 1988.

[143] Information Warfare Monitor. Tracking ghostnet: Investigating a cyber espionage network.Technical report, Information Warfare Monitor, 2009.

[144] S. Muthukrishnan. Data streams: Algorithms and applications. Found. Trends Theor.Comput. Sci., 1(2):117–236, August 2005.

[145] Shishir Nagaraja, Prateek Mittal, Chi-Yao Hong, Matthew Caesar, and Nikita Borisov. Bot-grep: Finding p2p bots with structured graph analysis. In Proceedings of the 19th USENIXConference on Security, USENIX Security’10, pages 7–7, Berkeley, CA, USA, 2010. USENIXAssociation.

[146] Akihiro NAKAO and Ping DU. Toward in-network deep machine learning for identifyingmobile applications and enabling application specific network slicing. IEICE Transactions onCommunications, E101.B, 01 2018.

119

[147] Srinivas Narayana, Anirudh Sivaraman, Vikram Nathan, Prateesh Goyal, Venkat Arun, Mo-hammad Alizadeh, Vimalkumar Jeyakumar, and Changhoon Kim. Language-directed hard-ware design for network performance monitoring. In Proceedings of the Conference of theACM Special Interest Group on Data Communication, SIGCOMM ’17, pages 85–98, NewYork, NY, USA, 2017. ACM.

[148] Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan,and Mark Oskin. Latency-tolerant software distributed shared memory. In 2015 USENIXAnnual Technical Conference (USENIX ATC 15), pages 291–305, Santa Clara, CA, July 2015.USENIX Association.

[149] Jacob Nelson, Brandon Myers, A. H. Hunter, Preston Briggs, Luis Ceze, Carl Ebeling, DanGrossman, Simon Kahan, and Mark Oskin. Crunching large graphs with commodity proces-sors. In Proceedings of the 3rd USENIX Conference on Hot Topic in Parallelism, HotPar’11,pages 10–10, Berkeley, CA, USA, 2011. USENIX Association.

[150] Neo4j. Neo4j - the worlds leading graph database, 2012.

[151] Barefoot Networks. Barefoot tofino. https://barefootnetworks.com/products/brief-tofino/.

[152] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graphanalytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating SystemsPrinciples, SOSP ’13, pages 456–471, New York, NY, USA, 2013. ACM.

[153] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graphanalytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating SystemsPrinciples, SOSP ’13, pages 456–471, New York, NY, USA, 2013. ACM.

[154] S. Noel, E. Harley, K.H. Tam, M. Limiero, and M. Share. Chapter 4 - cygraph: Graph-basedanalytics and visualization for cybersecurity. In Venkat N. Gudivada, Vijay V. Raghavan,Venu Govindaraju, and C.R. Rao, editors, Cognitive Computing: Theory and Applications,volume 35 of Handbook of Statistics, pages 117 – 167. Elsevier, 2016.

[155] Steven Noel, Eric Harley, Kam Him Tam, and Greg Gyor. Big-data architecture for cyberattack graphs representing security relationships in nosql graph databases. 2016.

[156] E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martnez, andC. Guestrin. Graphgen: An fpga framework for vertex-centric graph computation. In 2014IEEE 22nd Annual International Symposium on Field-Programmable Custom ComputingMachines, pages 25–28, May 2014.

[157] Joseph S. Nye. Cyber power. Belfer Center for Science and International Affairs, HarvardKennedy School, May 2010.

[158] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins.Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACMSIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1099–1110,New York, NY, USA, 2008. ACM.

[159] Sreepathi Pai and Keshav Pingali. A compiler for throughput optimization of graph al-gorithms on gpus. In Proceedings of the 2016 ACM SIGPLAN International Conference on

120

Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016, pages1–19, New York, NY, USA, 2016. ACM.

[160] Sreepathi Pai and Keshav Pingali. A compiler for throughput optimization of graph algo-rithms on gpus. SIGPLAN Not., 51(10):1–19, October 2016.

[161] Y. Pan, Y. Wang, Y. Wu, C. Yang, and J. D. Owens. Multi-gpu graph analytics. In 2017IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 479–490,May 2017.

[162] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De-Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiationin pytorch. In NIPS-W, 2017.

[163] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of MachineLearning Research, 12:2825–2830, 2011.

[164] G. Pellegrino, Q. Lin, C. Hammerschmidt, and S. Verwer. Learning behavioral fingerprintsfrom netflows using timed automata. In 2017 IFIP/IEEE Symposium on Integrated Networkand Service Management (IM), pages 308–316, May 2017.

[165] Vijayan Prabhakaran, Ming Wu, Xuetian Weng, Frank McSherry, Lidong Zhou, and MayaHaridasan. Managing large graphs on multi-cores with graph awareness. In Proceedings ofthe 2012 USENIX Conference on Annual Technical Conference, USENIX ATC’12, pages 4–4,Berkeley, CA, USA, 2012. USENIX Association.

[166] Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. W3C Rec-ommendation, January 2008. http://www.w3.org/TR/rdf-sparql-query/.

[167] Paulo Angelo Alves Resende and Andre Costa Drummond. A survey of random forest basedmethods for intrusion detection systems. ACM Comput. Surv., 51(3):48:1–48:36, May 2018.

[168] Robert Ricci, Eric Eide, and The CloudLab Team. Introducing CloudLab: Scientific infras-tructure for advancing cloud architectures and applications. USENIX ;login:, 39(6), December2014.

[169] Amitabha Roy, Laurent Bindschaedler, Jasmina Malicevic, and Willy Zwaenepoel. Chaos:Scale-out graph processing from secondary storage. 2015.

[170] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. X-stream: Edge-centric graph pro-cessing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles, SOSP ’13, pages 472–488, New York, NY, USA, 2013. ACM.

[171] Semih Salihoglu and Jennifer Widom. Gps: A graph processing system. In Proceedings of the25th International Conference on Scientific and Statistical Database Management, SSDBM,pages 22:1–22:12, New York, NY, USA, 2013. ACM.

[172] Samza. samza.apache.org, 2019.

121

[173] J. J. Santanna, R. van Rijswijk-Deij, R. Hofstede, A. Sperotto, M. Wierbosch, L. Z. Granville,and A. Pras. Booters an analysis of ddos-as-a-service attacks. In 2015 IFIP/IEEEInternational Symposium on Integrated Network Management (IM), pages 243–251, May2015.

[174] Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park,M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, and Pradeep Dubey. Navigating themaze of graph analytics frameworks using massive graph datasets. In Proceedings of the2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages979–990, New York, NY, USA, 2014. ACM.

[175] Scala parser combinator. https://github.com/scala/scala-parser-combinators, 2017.[Online; accessed October-2017].

[176] Hyunseok Seo, Jinwook Kim, and Min-Soo Kim. Gstream: A graph streaming processingmethod for large-scale graphs on gpus. SIGPLAN Not., 50(8):253–254, January 2015.

[177] Hyunseok Seo, Jinwook Kim, and Min-Soo Kim. Gstream: A graph streaming process-ing method for large-scale graphs on gpus. In Proceedings of the 20th ACM SIGPLANSymposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 253–254, New York, NY, USA, 2015. ACM.

[178] M. Shao, J. Li, F. Chen, and X. Chen. An efficient framework for detecting evolving anomaloussubgraphs in dynamic networks. In IEEE INFOCOM 2018 - IEEE Conference on ComputerCommunications, pages 2258–2266, April 2018.

[179] Jiaxin Shi, Youyang Yao, Rong Chen, Haibo Chen, and Feifei Li. Fast and concurrent rdfqueries with rdma-based distributed graph exploration. In Proceedings of the 12th USENIXConference on Operating Systems Design and Implementation, OSDI’16, pages 317–332,Berkeley, CA, USA, 2016. USENIX Association.

[180] Xiaokui Shu, Ke Tian, Andrew Ciambrone, and Danfeng Yao. Breaking the target: Ananalysis of target data breach and lessons learned. CoRR, abs/1701.04940, 2017.

[181] Julian Shun and Guy E. Blelloch. Ligra: A lightweight graph processing framework for sharedmemory. SIGPLAN Not., 48(8):135–146, February 2013.

[182] Jeremy G. Siek, Lie-Quan Lee, and Andrew Lumsdaine. The Boost Graph Library: UserGuide and Reference Manual. Addison-Wesley Professional, 2001.

[183] Srgio S.C. Silva, Rodrigo M.P. Silva, Raquel C.G. Pinto, and Ronaldo M. Salles. Botnets:A survey. Computer Networks, 57(2):378 – 403, 2013. Botnet Activity: Analysis, Detectionand Shutdown.

[184] Kamaldeep Singh, Sharath Chandra Guntuku, Abhishek Thakur, and Chittaranjan Hota. Bigdata analytics framework for peer-to-peer botnet detection using random forests. InformationSciences, 278:488 – 497, 2014.

[185] R. Sommer and V. Paxson. Outside the closed world: On using machine learning for networkintrusion detection. In 2010 IEEE Symposium on Security and Privacy, pages 305–316, May2010.

122

[186] John Sonchack, Adam J. Aviv, Eric Keller, and Jonathan M. Smith. Turboflow: Informationrich flow record generation on commodity switches. In Proceedings of the Thirteenth EuroSysConference, EuroSys ’18, pages 11:1–11:16, New York, NY, USA, 2018. ACM.

[187] John Sonchack, Oliver Michel, Adam J. Aviv, Eric Keller, and Jonathan M. Smith. Scalinghardware accelerated network monitoring to concurrent and dynamic queries with *flow. In2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 823–835, Boston, MA,2018. USENIX Association.

[188] Spark. spark.apache.org, 2019.

[189] Anna Sperotto, Ramin Sadre, Frank van Vliet, and Aiko Pras. A Labeled Data Set forFlow-Based Intrusion Detection, pages 39–50. Springer Berlin Heidelberg, Berlin, Heidelberg,2009.

[190] Storm. storm.apache.org, 2019.

[191] X. Sun, Y. Tan, Q. Wu, and J. Wang. Hasse diagram based algorithm for continuous temporalsubgraph query in graph stream. In 2017 6th International Conference on Computer Scienceand Network Technology (ICCSNT), pages 241–246, Oct 2017.

[192] Narayanan Sundaram, Nadathur Rajagopalan Satish, Md. Mostofa Ali Patwary, SubramanyaDulloor, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. Graphmat: Highperformance graph analytics made productive. CoRR, abs/1503.07241, 2015.

[193] J. Susymary and R. Lawrance. Graph theory analysis of protein-protein interaction networkand graph based clustering of proteins linked with zika virus using mcl algorithm. In 2017International Conference on Circuit ,Power and Computing Technologies (ICCPCT), pages1–7, April 2017.

[194] Symantec. Messagelabs intelligence report, March 2011.

[195] Symantec. Internet security threat report, 2018.

[196] R. Talpade, G. Kim, and S. Khurana. Nomad: traffic-based network monitoring frameworkfor anomaly detection. In Proceedings IEEE International Symposium on Computers andCommunications (Cat. No.PR00250), pages 442–451, July 1999.

[197] Jonas Tappolet and Abraham Bernstein. Applied temporal rdf: Efficient temporal queryingof rdf data with sparql. In Proceedings of the 6th European Semantic Web Conference on TheSemantic Web: Research and Applications, ESWC 2009 Heraklion, pages 308–322, Berlin,Heidelberg, 2009. Springer-Verlag.

[198] Florian Tegeler, Xiaoming Fu, Giovanni Vigna, and Christopher Kruegel. Botfinder: Findingbots in network traffic without deep packet inspection. In Proceedings of the 8th InternationalConference on Emerging Networking Experiments and Technologies, CoNEXT ’12, pages349–360, New York, NY, USA, 2012. ACM.

[199] Florian Tegeler, Xiaoming Fu, Giovanni Vigna, and Christopher Kruegel. Botfinder: Findingbots in network traffic without deep packet inspection. In Proceedings of the 8th InternationalConference on Emerging Networking Experiments and Technologies, CoNEXT ’12, pages349–360, New York, NY, USA, 2012. ACM.

123

[200] Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111,August 1990.

[201] VAST. Vast challenge 2013: Mini-challenge 3. http://vacommunity.org/VAST+Challenge+2013%3A+Mini-Challenge+3, 2013. [Online; accessed October-2017].

[202] Alexei Vazquez, Joao Gama Oliveira, Zoltan Dezso, Kwang-Il Goh, Imre Kondor, and Albert-Laszlo Barabasi. Modeling bursts and heavy tails in human dynamics. Phys. Rev. E,73:036127, Mar 2006.

[203] Verizon. 2018 data breach investigations report, 2018.

[204] W3C. Rdf reification, 2019.

[205] David Y. Wang, Stefan Savage, and Geoffrey M. Voelker. Juice: A longitudinal study of anseo botnet. In NDSS. The Internet Society, 2013.

[206] P. Wang, F. Wang, F. Lin, and Z. Cao. Identifying peer-to-peer botnets through periodicitybehavior analysis. In 2018 17th IEEE International Conference On Trust, Security AndPrivacy In Computing And Communications/ 12th IEEE International Conference On BigData Science And Engineering (TrustCom/BigDataSE), pages 283–288, Aug 2018.

[207] Gregory C. Wilshusen. Cybersecurity: A better defined and implemented national strategyis needed to address persistent challenges, 2013.

[208] Wireshark. Wiresharek.

[209] James Wyke. The zeroaccess botnet: Mining and fraud for massive financial gain. In SophosTechnical Paper, 2012.

[210] Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. Graphx: A re-silient distributed graph system on spark. In First International Workshop on Graph DataManagement Experiences and Systems, GRADES ’13, pages 2:1–2:6, New York, NY, USA,2013. ACM.

[211] D. Yan, Y. Huang, M. Liu, H. Chen, J. Cheng, H. Wu, and C. Zhang. Graphd: Distributedvertex-centric graph processing beyond the memory limit. IEEE Transactions on Parallel andDistributed Systems, 29(1):99–114, Jan 2018.

[212] ChengGuo Yin, ShuangQing Li, and Qi Li. Network traffic classification via hmm under theguidance of syntactic structure. Computer Networks, 56(6):1814 – 1825, 2012.

[213] Danny Yuxing, Hitesh Dharmdasani, Sarah Meiklejohn, Vacha Dave, Chris Grier, Damon Mc-Coy, Stefan Savage, Nicholas Weaver, Alex Snoeren, and Kirill Levchenko. Botcoin: Monetiz-ing stolen cycles. In Proceedings of the Network and Distributed System Security Symposium,NDSS ’14, 2014.

[214] Sebastian Zander, Thuy Nguyen, and Grenville Armitage. Self-learning ip traffic classificationbased on statistical flow characteristics. In Constantinos Dovrolis, editor, Passive and ActiveNetwork Measurement, pages 325–328, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.

124

[215] Ying Zhang, Pham Minh Duc, Oscar Corcho, and Jean-Paul Calbimonte. Srbench: Astreaming rdf/sparql benchmark. In Proceedings of the 11th International Conference onThe Semantic Web - Volume Part I, ISWC’12, pages 641–657, Berlin, Heidelberg, 2012.Springer-Verlag.

[216] Yunhao Zhang, Rong Chen, and Haibo Chen. Sub-millisecond stateful stream querying overfast-evolving linked data. In Proceedings of the 26th Symposium on Operating SystemsPrinciples, SOSP ’17, pages 614–630, New York, NY, USA, 2017. ACM.

[217] Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman P.Amarasinghe. Graphit - A high-performance DSL for graph analytics. CoRR, abs/1805.00923,2018.

[218] Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum. Botgraph:Large scale spamming botnet detection. In Proceedings of the 6th USENIX Symposiumon Networked Systems Design and Implementation, NSDI’09, pages 321–334, Berkeley, CA,USA, 2009. USENIX Association.

[219] Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexan-der S. Szalay. Flashgraph: Processing billion-node graphs on an array of commodity ssds. In13th USENIX Conference on File and Storage Technologies (FAST 15), pages 45–58, SantaClara, CA, 2015. USENIX Association.

[220] Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. Gemini: A computation-centric distributed graph processing system. In 12th USENIX Symposium on OperatingSystems Design and Implementation (OSDI 16), pages 301–316, Savannah, GA, 2016.USENIX Association.

[221] Yunyue Zhu and Dennis Shasha. Statstream: Statistical monitoring of thousands of datastreams in real time. In Proceedings of the 28th International Conference on Very LargeData Bases, VLDB ’02, pages 358–369. VLDB Endowment, 2002.

[222] M. Zolotukhin, T. Hmlinen, T. Kokkonen, and J. Siltanen. Online detection of anomalousnetwork flows with soft clustering. In 2015 7th International Conference on New Technologies,Mobility and Security (NTMS), pages 1–5, July 2015.

[223] Javier lvarez Cid-Fuentes, Claudia Szabo, and Katrina Falkner. An adaptive framework forthe detection of novel botnets. Computers & Security, 79:148 – 161, 2018.

Appendix A

SAL and Flink Programs

This Appendix contains complete code samples of SAL and Apache Flink programs used in

this work.

A.1 Machine Learning Example

This SAL query is used on the CTU-13 dataset [63] as discussed in Chapter 6.

Listing A.1: Machine Learning Program

1 Edges = VastStream (” l o c a l h o s t ” , 9999 ) ;

2 PARTITION Edges By SourceIp , DestIp ;

3 HASH SourceIp WITH IpHashFunction ;

4 HASH DestIp With IpHashFunction ;

5 Vert icesByDest = STREAM Edges BY DestIp ;

6 Vert icesBySrc = STREAM Edges BY SourceIp ;

7 f e a t u r e 0 = FOREACH Vert icesBySrc GENERATE ave ( SrcTotalBytes , 1 0 0 0 0 , 2 ) ;

8 f e a t u r e 1 = FOREACH Vert icesBySrc GENERATE var ( SrcTotalBytes , 1 0 0 0 0 , 2 ) ;

9 f e a t u r e 2 = FOREACH Vert icesBySrc GENERATE ave ( DestTotalBytes , 1 0 0 0 0 , 2 ) ;

10 f e a t u r e 3 = FOREACH Vert icesBySrc GENERATE var ( DestTotalBytes , 1 0 0 0 0 , 2 ) ;

11 f e a t u r e 4 = FOREACH Vert icesBySrc GENERATE ave ( DurationSeconds , 1 0 0 0 0 , 2 ) ;

12 f e a t u r e 5 = FOREACH Vert icesBySrc GENERATE var ( DurationSeconds , 1 0 0 0 0 , 2 ) ;

13 f e a t u r e 6 = FOREACH Vert icesBySrc GENERATE ave ( SrcPayloadBytes , 1 0 0 0 0 , 2 ) ;

126

14 f e a t u r e 7 = FOREACH Vert icesBySrc GENERATE var ( SrcPayloadBytes , 1 0 0 0 0 , 2 ) ;

15 f e a t u r e 8 = FOREACH Vert icesBySrc GENERATE ave ( DestPayloadBytes , 1 0 0 0 0 , 2 ) ;

16 f e a t u r e 9 = FOREACH Vert icesBySrc GENERATE var ( DestPayloadBytes , 1 0 0 0 0 , 2 ) ;

17 f ea tu r e10 = FOREACH Vert icesBySrc GENERATE ave ( SrcPacketCount , 1 0 0 0 0 , 2 ) ;

18 f ea tu r e11 = FOREACH Vert icesBySrc GENERATE var ( SrcPacketCount , 1 0 0 0 0 , 2 ) ;

19 f ea tu r e12 = FOREACH Vert icesBySrc GENERATE ave ( DestPacketCount , 1 0 0 0 0 , 2 ) ;

20 f ea tu r e13 = FOREACH Vert icesBySrc GENERATE var ( DestPacketCount , 1 0 0 0 0 , 2 ) ;

21 f ea tu r e14 = FOREACH Vert icesByDest GENERATE ave ( SrcTotalBytes , 1 0 0 0 0 , 2 ) ;

22 f ea tu r e15 = FOREACH Vert icesByDest GENERATE var ( SrcTotalBytes , 1 0 0 0 0 , 2 ) ;

23 f ea tu r e16 = FOREACH Vert icesByDest GENERATE ave ( DestTotalBytes , 1 0 0 0 0 , 2 ) ;

24 f ea tu r e17 = FOREACH Vert icesByDest GENERATE var ( DestTotalBytes , 1 0 0 0 0 , 2 ) ;

25 f ea tu r e18 = FOREACH Vert icesByDest GENERATE ave ( DurationSeconds , 1 0 0 0 0 , 2 ) ;

26 f ea tu r e19 = FOREACH Vert icesByDest GENERATE var ( DurationSeconds , 1 0 0 0 0 , 2 ) ;

27 f ea tu r e20 = FOREACH Vert icesByDest GENERATE ave ( SrcPayloadBytes , 1 0 0 0 0 , 2 ) ;

28 f ea tu r e21 = FOREACH Vert icesByDest GENERATE var ( SrcPayloadBytes , 1 0 0 0 0 , 2 ) ;

29 f ea tu r e22 = FOREACH Vert icesByDest GENERATE ave ( DestPayloadBytes , 1 0 0 0 0 , 2 ) ;

30 f ea tu r e23 = FOREACH Vert icesByDest GENERATE var ( DestPayloadBytes , 1 0 0 0 0 , 2 ) ;

31 f ea tu r e24 = FOREACH Vert icesByDest GENERATE ave ( SrcPacketCount , 1 0 0 0 0 , 2 ) ;

32 f ea tu r e25 = FOREACH Vert icesByDest GENERATE var ( SrcPacketCount , 1 0 0 0 0 , 2 ) ;

33 f ea tu r e26 = FOREACH Vert icesByDest GENERATE ave ( DestPacketCount , 1 0 0 0 0 , 2 ) ;

34 f ea tu r e27 = FOREACH Vert icesByDest GENERATE var ( DestPacketCount , 1 0 0 0 0 , 2 ) ;

A.2 Apache Flink Triangle Implementation

This is the full Apache Flink program used to compare against the SAL triangle query as

discussed in Section 7.2. We present the Netflow class, the NetflowSource class, and the actual

benchmark code. The Netflow class represents the VAST netflow format [201]. NetflowSource

extends the RichParallelSourceFunction class which allows the generation of netflow data across

the Flink cluster. Only the benchmark code was counted in the total line count for the Apache

Flink implementation.

127

Listing A.2: Apache Flink Netflow Class

1 package edu . cu . f l i n k . benchmarks ;

2

3 public class Netf low {

4 public int samGeneratedId ;

5 public int label ;

6 public double timeSeconds ;

7 public St r ing parseDate ;

8 public St r ing dateTimeString ;

9 public St r ing p ro to co l ;

10 public St r ing protocolCode ;

11 public St r ing source Ip ;

12 public St r ing des t Ip ;

13 public int sourcePort ;

14 public int destPort ;

15 public St r ing moreFragments ;

16 public int countFragments ;

17 public double durat ionSeconds ;

18 public long srcPayloadBytes ;

19 public long destPayloadBytes ;

20 public long s rcTota lBytes ;

21 public long destTota lBytes ;

22 public long f i r s tSeenSrcPacketCount ;

23 public long f i r s tSeenDestPacketCount ;

24 public int recordForceOut ;

25

26 public Netf low ( int samGeneratedId ,

27 int label ,

128

28 double timeSeconds ,

29 St r ing parseDate ,

30 St r ing dateTimeString ,

31 St r ing protoco l ,

32 St r ing protocolCode ,

33 St r ing sourceIp ,

34 St r ing destIp ,

35 int sourcePort ,

36 int destPort ,

37 St r ing moreFragments ,

38 int countFragments ,

39 double durat ionSeconds ,

40 long srcPayloadBytes ,

41 long destPayloadBytes ,

42 long srcTota lBytes ,

43 long destTotalBytes ,

44 long f i r s tSeenSrcPacketCount ,

45 long f i r s tSeenDestPacketCount ,

46 int recordForceOut )

47 {

48 this . label = label ; ;

49 this . t imeSeconds = timeSeconds ;

50 this . parseDate = parseDate ;

51 this . dateTimeString = dateTimeString ;

52 this . p r o to co l = pro to co l ;

53 this . protocolCode = protocolCode ;

54 this . s ource Ip = source Ip ;

55 this . de s t Ip = des t Ip ;

56 this . sourcePort = sourcePort ;

57 this . destPort = destPort ;

129

58 this . moreFragments = moreFragments ;

59 this . countFragments = countFragments ;

60 this . durat ionSeconds = durat ionSeconds ;

61 this . s rcPayloadBytes = srcPayloadBytes ;

62 this . destPayloadBytes = destPayloadBytes ;

63 this . s r cTota lBytes = srcTota lBytes ;

64 this . destTota lBytes = destTota lBytes ;

65 this . f i r s tSeenSrcPacketCount = f i r s tSeenSrcPacketCount ;

66 this . f i r s tSeenDestPacketCount = f i r s tSeenDestPacketCount ;

67 this . recordForceOut = recordForceOut ;

68 }

69

70 public St r ing toS t r i ng ( )

71 {

72 return timeSeconds + ” , ” + source Ip + ” , ” + des t Ip ;

73 }

74 }

Listing A.3: Apache Flink NetflowSource Class

1 package edu . cu . f l i n k . benchmarks ;

2

3 import org . apache . f l i n k . api . common . f u n c t i o n s . RuntimeContext ;

4 import org . apache . f l i n k . streaming . ap i . f u n c t i o n s . source . ∗ ;

5 import org . apache . f l i n k . streaming . ap i . watermark . Watermark ;

6

7 import java . time . In s tant ;

8 import java . u t i l . Random ;

9

10 /∗∗

130

11 ∗ This c r e a t e s n e t f l o w s from a poo l o f IPs .

12 ∗/

13 public class Netf lowSource extends RichPara l l e lSourceFunct ion<Netflow> {

14

15 /// The number o f n e t f l o w s to c r e a t e .

16 private int numEvents ;

17

18 /// How many n e t f l o w s c r e a t e d .

19 private int currentEvent ;

20

21 /// How many unique IPs in the poo l

22 private int numIps ;

23

24 /// Used to c r e a t e the n e t f l o w s wi th randomly s e l e c t e d i p s

25 private Random random ;

26

27 /// Time from s t a r t o f run

28 private double time = 0 ;

29

30 /// The time in seconds between n e t f l o w s

31 private double increment = 0 . 1 ;

32

33 /// When the f i r s t n e t f l o w was c r e a t e d

34 private double t1 ;

35

36 public int samGeneratedId ;

37 public int label ;

38 public double timeSeconds ;

39 public St r ing parseDate ;

40 public St r ing dateTimeString ;

131

41 public St r ing p ro to co l ;

42 public St r ing protocolCode ;

43 public St r ing source Ip ;

44 public St r ing des t Ip ;

45 public int sourcePort ;

46 public int destPort ;

47 public St r ing moreFragments ;

48 public int countFragments ;

49 public double durat ionSeconds ;

50 public long srcPayloadBytes ;

51 public long destPayloadBytes ;

52 public long s rcTota lBytes ;

53 public long destTota lBytes ;

54 public long f i r s tSeenSrcPacketCount ;

55 public long f i r s tSeenDestPacketCount ;

56 public int recordForceOut ;

57

58 /∗∗

59 ∗

60 ∗ @param numEvents The number o f n e t f l o w s to produce .

61 ∗ @param numIps The number o f i p s in the poo l .

62 ∗ @param r a t e The r a t e o f n e t f l o w s produced in n e t f l o w s / s .

63 ∗/

64 public Netf lowSource ( int numEvents , int numIps , double r a t e )

65 {

66 this . numEvents = numEvents ;

67 currentEvent = 0 ;

68 this . numIps = numIps ;

69 this . increment = 1 / ra t e ;

70 // t h i s . random = new Random ( ) ;

132

71

72 label = 0 ;

73 parseDate = ” parseDate ” ;

74 dateTimeString = ” dateTimeString ” ;

75 p ro to co l = ” pro to co l ” ;

76 protocolCode = ” protocolCode ” ;

77 sourcePort = 80 ;

78 destPort = 80 ;

79 moreFragments = ”moreFragments” ;

80 countFragments = 0 ;

81 durat ionSeconds = 0 . 1 ;

82 srcPayloadBytes = 10 ;

83 destPayloadBytes = 10 ;

84 srcTota lBytes = 15 ;

85 destTota lBytes = 15 ;

86 f i r s tSeenSrcPacketCount = 5 ;

87 f i r s tSeenDestPacketCount =5;

88 recordForceOut = 0 ;

89 }

90

91 @Override

92 public void run ( SourceContext<Netflow> out ) throws Exception

93 {

94 RuntimeContext context = getRuntimeContext ( ) ;

95 int taskId = context . getIndexOfThisSubtask ( ) ;

96

97 i f ( currentEvent == 0) {

98 t1 = ( ( double ) In s tant . now ( ) . toEpochMi l l i ( ) ) / 1000 ;

99 this . random = new Random( taskId ) ;

100 }

133

101

102 while ( currentEvent < numEvents )

103 {

104 long t = Ins tant . now ( ) . toEpochMi l l i ( ) ;

105 double currentTime = ( ( double ) t ) / 1000 ;

106 i f ( currentEvent % 1000 == 0) {

107 double expectedTime = currentEvent ∗ increment ;

108 double actualTime = currentTime − t1 ;

109 System . out . p r i n t l n ( ”Expected time : ” + expectedTime +

110 ” Actual time : ” + actualTime ) ;

111 }

112

113 double d i f f = currentTime − t1 ;

114 i f ( d i f f < currentEvent ∗ increment )

115 {

116 long numMil l i seconds = ( long ) ( currentEvent ∗

117 increment − d i f f ) ∗ 1000 ;

118 Thread . s l e e p ( numMil l i seconds ) ;

119 }

120

121 int source = random . next Int ( numIps ) ;

122 int dest = random . next Int ( numIps ) ;

123 St r ing source Ip = ”node” + source ;

124 St r ing des t Ip = ”node” + dest ;

125 Netf low net f l ow = new Netf low ( currentEvent , label , time ,

126 parseDate , dateTimeString ,

127 protoco l , protocolCode ,

128 sourceIp , destIp , sourcePort ,

129 destPort , moreFragments ,

130 countFragments ,

134

131 durationSeconds , srcPayloadBytes ,

132 destPayloadBytes , s rcTota lBytes ,

133 destTotalBytes , f i r s tSeenSrcPacketCount ,

134 f i rstSeenDestPacketCount ,

135 recordForceOut ) ;

136 out . collectWithTimestamp ( netf low , t ) ;

137 out . emitWatermark (new Watermark ( t ) ) ;

138 currentEvent++;

139 time += increment ;

140 }

141 }

142

143 @Override

144 public void cance l ( )

145 {

146 this . currentEvent = numEvents ;

147 }

148 }

Listing A.4: Apache Flink Triangle Benchmark

1 package edu . cu . f l i n k . benchmarks ;

2

3 import org . apache . commons . c l i . ∗ ;

4 import org . apache . f l i n k . api . common . f u n c t i o n s . F latJo inFunct ion ;

5 import org . apache . f l i n k . api . common . f u n c t i o n s . MapFunction ;

6 import org . apache . f l i n k . api . common . f u n c t i o n s . ReduceFunction ;

7 import org . apache . f l i n k . api . java . f u n c t i o n s . KeySe lector ;

8 import org . apache . f l i n k . api . java . tup l e . Tuple2 ;

9 import org . apache . f l i n k . core . f s . Fi leSystem ;

135

10 import org . apache . f l i n k . streaming . ap i . T imeCharacte r i s t i c ;

11 import org . apache . f l i n k . streaming . ap i . datastream . ∗ ;

12 import org . apache . f l i n k . streaming . ap i . environment . ∗ ;

13 import org . apache . f l i n k . streaming . ap i . f u n c t i o n s . co . ProcessJo inFunct ion ;

14 import org . apache . f l i n k . streaming . ap i . windowing . a s s i g n e r s . ∗ ;

15 import org . apache . f l i n k . streaming . ap i . windowing . time . Time ;

16 import org . apache . f l i n k . u t i l . C o l l e c t o r ;

17

18 public class BenchmarkTrianglesNetf low {

19

20 private stat ic class SourceKeySe lector

21 implements KeySelector<Netflow , Str ing>

22 {

23 @Override

24 public St r ing getKey ( Netf low edge ) {

25 return edge . source Ip ;

26 }

27 }

28

29 private stat ic class DestKeySelector

30 implements KeySelector<Netflow , Str ing>

31 {

32 @Override

33 public St r ing getKey ( Netf low edge ) {

34 return edge . de s t Ip ;

35 }

36 }

37

38 private stat ic class LastEdgeKeySelector

39 implements KeySelector<Netflow , Tuple2<Str ing , Str ing>>

136

40 {

41 @Override

42 public Tuple2<Str ing , Str ing> getKey ( Netf low e1 )

43 {

44 return new Tuple2<Str ing , Str ing >(e1 . destIp , e1 . source Ip ) ;

45 }

46 }

47

48 private stat ic class Triad

49 {

50 Netf low e1 ;

51 Netf low e2 ;

52

53 public Triad ( Netf low e1 , Netf low e2 ) {

54 this . e1 = e1 ;

55 this . e2 = e2 ;

56 }

57

58 public St r ing toS t r i ng ( )

59 {

60 St r ing s t r = e1 . t oS t r i ng ( ) + ” ” + e2 . t oS t r i ng ( ) ;

61 return s t r ;

62 }

63 }

64

65 /∗∗

66 ∗ Key s e l e c t o r t h a t r e t u r n s a t u p l e wi th the source o f the f i r s t

67 ∗ edge and the d e s t i n a t i o n o f the second edge .

68 ∗/

69 private stat ic class TriadKeySelector

137

70 implements KeySelector<Triad , Tuple2<Str ing , Str ing>>

71 {

72 @Override

73 public Tuple2<Str ing , Str ing> getKey ( Triad t r i a d )

74 {

75 return new Tuple2<Str ing , Str ing >( t r i a d . e1 . sourceIp , t r i a d . e2 . de s t Ip ) ;

76 }

77 }

78

79 private stat ic class Triang le

80 {

81 Netf low e1 ;

82 Netf low e2 ;

83 Netf low e3 ;

84

85 public Triang le ( Netf low e1 , Netf low e2 , Netf low e3 )

86 {

87 this . e1 = e1 ;

88 this . e2 = e2 ;

89 this . e3 = e3 ;

90 }

91

92 public St r ing toS t r i ng ( )

93 {

94 St r ing s t r = e1 . t oS t r i ng ( ) + ” ” + e2 . t oS t r i ng ( ) + ” ” +

95 e3 . t oS t r i ng ( ) ;

96 return s t r ;

97 }

98 }

99

138

100 private stat ic class EdgeJoiner

101 extends ProcessJoinFunct ion<Netflow , Netflow , Triad>

102 {

103 private double queryWindow ;

104

105 public EdgeJoiner (double queryWindow )

106 {

107 this . queryWindow = queryWindow ;

108 }

109

110 @Override

111 public void processElement ( Netf low e1 , Netf low e2 ,

112 Context ctx , Co l l e c to r<Triad> out )

113 {

114 i f ( e1 . t imeSeconds < e2 . t imeSeconds ) {

115 i f ( e2 . t imeSeconds − e1 . t imeSeconds <= queryWindow ) {

116 out . c o l l e c t (new Triad ( e1 , e2 ) ) ;

117 }

118 }

119 }

120 }

121

122 private stat ic class TriadJo iner

123 extends ProcessJoinFunct ion<Triad , Netflow , Triangle>

124 {

125 private double queryWindow ;

126

127 public TriadJo iner (double queryWindow )

128 {

129 this . queryWindow = queryWindow ;

139

130 }

131

132 @Override

133 public void processElement ( Triad t r iad , Netf low e3 ,

134 Context ctx , Co l l e c to r<Triangle> out )

135 {

136 i f ( t r i a d . e2 . t imeSeconds < e3 . t imeSeconds ) {

137 i f ( e3 . t imeSeconds − t r i a d . e1 . t imeSeconds <= queryWindow ) {

138 out . c o l l e c t (new Triang le ( t r i a d . e1 , t r i a d . e2 , e3 ) ) ;

139 }

140 }

141 }

142 }

143

144 public stat ic void main ( St r ing [ ] a rgs ) throws Exception {

145

146 Options opt ions = new Options ( ) ;

147 Option numNetflowsOption = new Option ( ”nn” , ”numNetflows” , true ,

148 ”Number o f ne t f l ows to c r e a t e per source . ” ) ;

149 Option numIpsOption = new Option ( ” nip ” , ”numIps” , true ,

150 ”Number o f i p s in the pool . ” ) ;

151 Option rateOption = new Option ( ” r ” , ” ra t e ” , true ,

152 ”The ra t e that ne t f l ows are generated . ” ) ;

153 Option numSourcesOption = new Option ( ”ns” , ”numSources” , true ,

154 ”The number o f net f l ow sourc e s . ” ) ;

155 Option queryWindowOption = new Option ( ”qw” , ”queryWindow” , true ,

156 ”The length o f the query in seconds . ” ) ;

157 Option outputFi leOpt ion = new Option ( ” out ” , ” outputF i l e ” , true ,

158 ”Where the output should go . ” ) ;

159 Option outputNetf lowOption = new Option ( ” net ” , ” outputNetf low ” , true ,

140

160 ”Where the ne t f l ows should go ( op t i ona l ) . ” ) ;

161 Option outputTriadOption = new Option ( ” t r i a d ” , ” outputTriads ” , true ,

162 ”Where the t r i a d s should go ( op t i ona l ) . ” ) ;

163

164 numNetflowsOption . setRequi red ( true ) ;

165 numIpsOption . setRequi red ( true ) ;

166 rateOption . setRequi red ( true ) ;

167 numSourcesOption . setRequired ( true ) ;

168 queryWindowOption . setRequi red ( true ) ;

169 outputFi leOpt ion . setRequi red ( true ) ;

170

171 opt ions . addOption ( numNetflowsOption ) ;

172 opt ions . addOption ( numIpsOption ) ;

173 opt ions . addOption ( rateOption ) ;

174 opt ions . addOption ( numSourcesOption ) ;

175 opt ions . addOption ( queryWindowOption ) ;

176 opt ions . addOption ( outputFi leOpt ion ) ;

177 opt ions . addOption ( outputNetf lowOption ) ;

178 opt ions . addOption ( outputTriadOption ) ;

179

180 CommandLineParser par s e r = new Defau l tParse r ( ) ;

181 HelpFormatter fo rmatte r = new HelpFormatter ( ) ;

182 CommandLine cmd = null ;

183

184 try {

185 cmd = par s e r . parse ( opt ions , args ) ;

186 } catch ( ParseException e ) {

187 System . out . p r i n t l n ( e . getMessage ( ) ) ;

188 fo rmatte r . pr intHe lp ( ” u t i l i t y −name” , opt ions ) ;

189

141

190 System . e x i t ( 1 ) ;

191 }

192

193 int numEvents = I n t e g e r . pa r s e In t (cmd . getOptionValue ( ”numNetflows” ) ) ;

194 int numIps = I n t e g e r . pa r s e In t (cmd . getOptionValue ( ”numIps” ) ) ;

195 double r a t e = Double . parseDouble (cmd . getOptionValue ( ” ra t e ” ) ) ;

196 int numSources = I n t e g e r . pa r s e In t (cmd . getOptionValue ( ”numSources” ) ) ;

197 double queryWindow = Double . parseDouble (

198 cmd . getOptionValue ( ”queryWindow” ) ) ;

199 St r ing outputF i l e = cmd . getOptionValue ( ” outputF i l e ” ) ;

200 St r ing outputNet f lowFi l e = cmd . getOptionValue ( ” outputNetf low ” ) ;

201 St r ing outputTr iadFi l e = cmd . getOptionValue ( ” outputTriads ” ) ;

202

203 // g e t the e x e c u t i o n environment

204 f ina l StreamExecutionEnvironment env =

205 StreamExecutionEnvironment . getExecutionEnvironment ( ) ;

206 env . s e t P a r a l l e l i s m ( numSources ) ;

207 env . se tSt reamTimeCharacter i s t i c ( T imeCharacte r i s t i c . EventTime ) ;

208

209 Netf lowSource net f lowSource = new Netf lowSource ( numEvents , numIps ,

210 ra t e ) ;

211 DataStreamSource<Netflow> ne t f l ows = env . addSource ( net f lowSource ) ;

212

213 i f ( outputNet f lowFi l e != null ) {

214 ne t f l ows . writeAsText ( outputNet f lowFi le ,

215 Fi leSystem . WriteMode .OVERWRITE) ;

216 }

217

218 DataStream<Triad> t r i a d s = net f l ows

219 . keyBy (new DestKeySelector ( ) )

142

220 . i n t e r v a l J o i n ( ne t f l ows . keyBy (new SourceKeySe lector ( ) ) )

221 . between (Time . m i l l i s e c o n d s ( 0 ) ,

222 Time . m i l l i s e c o n d s ( ( long ) queryWindow ∗ 1000))

223 . p roce s s (new EdgeJoiner ( queryWindow ) ) ;

224

225 i f ( outputTr iadFi l e != null ) {

226 t r i a d s . writeAsText ( outputTriadFi le ,

227 Fi leSystem . WriteMode .OVERWRITE) ;

228 }

229

230 DataStream<Triangle> t r i a n g l e s = t r i a d s

231 . keyBy (new TriadKeySelector ( ) )

232 . i n t e r v a l J o i n ( ne t f l ows . keyBy (new LastEdgeKeySelector ( ) ) )

233 . between (Time . m i l l i s e c o n d s ( 0 ) ,

234 Time . m i l l i s e c o n d s ( ( long ) queryWindow ∗ 1000))

235 . p roce s s (new TriadJo iner ( queryWindow ) ) ;

236

237 t r i a n g l e s . writeAsText ( outputFi le , Fi leSystem . WriteMode .OVERWRITE) ;

238

239 env . execute ( ) ;

240 }

241

242 }