OPTIMIZING WEB APPLICATION FUZZING WITH GENETIC … · Chapter 1 Introduction..... 1 1.1 Fuzz Testing for Vulnerability Discovery . . . . . . . . . . . . . . . . 3 ... With the advent

OPTIMIZING WEB APPLICATION FUZZING WITH GENETIC ALGORITHMSAND LANGUAGE THEORY

BY

SCOTT MICHAEL SEAL

A Thesis Submitted to the Graduate Faculty of

WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES

in Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

Computer Science

May 2016

Winston-Salem, North Carolina

Approved By:

Errin Fulp, Ph.D., Advisor

William Turkett Ph.D., Chair

David John Ph.D.

Acknowledgments

There are many people who helped to make this thesis possible, and they deservedthanks and gratitude I could never fully provide. First and foremost, thanks to tomy family for being supportive over the past—what should have been two but thenbecame three—years. To Errin Fulp, whose patience in the beginning, middle and endof my academic career kept me afloat, who sparked my interest in computer security,who set me up with career and academic opportunities I did not deserve, who bailedme out of moderately serious (albeit laughable, and ridiculous) trouble...thank you.This research would have never happened if your door were not always open. Thanksto the Wake Forest Computer Science Department, in (no) particular (order): JenniferBurg, Daniel Canas, Sam Cho, Don Gage, David John, Paul Pauca, Stan Thomas,and William Turkett. Thanks to Todd Torgersen for patiently teaching me things Iwas already supposed to know, and for spending his free time helping me flesh outthe ideas that took this research from “eh” to worthwhile. Finally, a special personalthank-you to Sarah Reehl for putting up with my complaining, reading numerousiterations of this document, and for not supporting my near-daily urge to leave thiswork unfinished.

ii

Table of Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

List of Figures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Fuzz Testing for Vulnerability Discovery . . . . . . . . . . . . . . . . 3

1.2 Fuzzing with Evolutionary Algorithms . . . . . . . . . . . . . . . . . 5

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 2 Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 History of Fuzzing and its Fundamentals Components . . . . . . . . . 9

2.2 Present Day Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 General Purpose Tools and Techniques . . . . . . . . . . . . . 15

2.2.2 Modern Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3 Web Application Fuzzing . . . . . . . . . . . . . . . . . . . . . 19

2.3 Fuzzing and Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . 23

2.4 Grammar Fuzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . 27

Chapter 3 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Genetic Algorithm Components . . . . . . . . . . . . . . . . . 33

3.3 CHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Problem Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


Chapter 4 Evolutionary Algorithm Web Fuzzing Framework . . . . . . . 39

4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.2 Attack Grammars . . . . . . . . . . . . . . . . . . . . . . . . . 43

iii

4.1.3 Fitness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.4 Niche-penalty Heuristics-based Genetic Algorithm . . . . . . . 46

4.1.5 CHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


Chapter 5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1 Testing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Benchmark Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.1 Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.2 Markov Model Monte Carlo . . . . . . . . . . . . . . . . . . . 52

5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.4 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4.1 Fitness and Diversity . . . . . . . . . . . . . . . . . . . . . . . 54

5.4.2 Exploits Found . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Chapter 6 Conclusion and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

iv

List of Figures

1.1 OWASP Top 10 web vulnerabilities shows the frequency of web-basedinjection attacks, and the importance of defending against them . . . 3

1.2 The steps of the Microsoft Secure Development Life-cycle . . . . . . . 4

2.1 The general steps involved in fuzzing campaigns . . . . . . . . . . . . 11

2.2 An example excerpt of a peach pit used for Generation-based fuzzing [11] 17

2.3 Boundary testing recommendations according to Sutton et al. [44] . . 18

2.4 A vulnerable input form that can be exploited using SQL injection . . 22

2.5 Fitness heuristic categories considered by Seagle [53] . . . . . . . . . 25

2.6 An excerpt of a manually-written attack grammar for finding Cross-Site Scripting vulnerabilities [26] . . . . . . . . . . . . . . . . . . . . . 26

3.1 Three traditional crossover methods for creating new chromosomes [3] 34

4.1 Flow graph of preprocessing stage . . . . . . . . . . . . . . . . . . . . 42

4.2 Example extract of Parse tree derived from positive examples of SQLinjection tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 A flowchart of the heuristics-based Evolutionary Algorithm fuzzingframework proof-of-concept . . . . . . . . . . . . . . . . . . . . . . . 49

5.1 The front page of the testbed used for measuring the e↵ectiveness ofniche-penalty GA-based web fuzzing [15] . . . . . . . . . . . . . . . . 52

5.2 mean fitness per generation . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 median diversity of value and symbol representations per generationfor GA and CHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 median diversity of value and symbol representations per generationfor Random and Markov Model Monte Carlo . . . . . . . . . . . . . . 57

5.5 Total unique exploits per simulation and average number of exploitsper trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.6 Average number of exploits found in each generation . . . . . . . . . 59

v

List of Tables

4.1 An example preprocessing of a positive example . . . . . . . . . . . . 44

5.1 Total exploits per simulation . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Number of trials per simulation type without an exploit . . . . . . . . 60

5.3 Highest number of exploit strings found throughout a singular trial . 60

vi

Abstract

The widespread availability and use of computing and internet resources require soft-ware developers to implement secure development standards and rigorous testing toprevent vulnerabilities. Due to human fallibility, programming errors and logical in-consistencies abound—thus, conventions for testing software are required to ensureConfidentiality, Integrity, and Availability of sensitive user data. A combination ofmanual inspection and automated analysis of programs is necessary to achieve thisgoal. Because of the massive size of many codebases, especially considering the in-corporation of third-party software and infrastructure, thorough manual code reviewby security experts is not always an option. Therefore, e↵ective automated methodsfor testing software systems are essential.

Fuzz testing is a popular technique for automating the discovery of bugs and se-curity errors in software systems ranging from UNIX utilities to web applications.Although mutation and generation-based fuzzing have been in use for many years,fuzzers that intelligently manage test case generation are actively being researched.In particular, optimally testing web applications with limited feedback remains elu-sive. This research presents a use of Evolutionary Algorithms to generate test caseswhich expose vulnerabilities in web applications. This thesis utilizes grammaticallyanalyzed positive examples of injection strings related to a common web vulnerabilityin order to build a set of attack grammars that guide fitness metrics and test casegeneration. In lieu of a manually written, exhaustive attack grammar, the set of at-tack grammars are automatically derived from positive examples. The e�cacy of thisalgorithm is compared to other methods of solution generation, such as Markov ModelMonte Carlo. Finally, two types of Evolutionary Algorithms (a Genetic Algorithmwith heuristic-based repopulation criteria and CHC) are implemented in the fuzzingframework, and evaluated according to their ability to e↵ectively narrow the searchspace. The results demonstrate that Evolutionary Algorithms with grammar-basedheuristics are able to find unique solutions that are grammatically similar, yet stillunique, to a corpus of positive examples.

vii

Chapter 1: Introduction

As computing technologies broaden their reach to individual consumers, commu-

nities, and corporate industries, providing security and privacy of those resources

becomes an important priority. Because of the expansive reach of internet services,

and the growing interconnectivity of our world, users expect software companies to

provide Confidentiality, Integrity, and Availability of sensitive data. According to a

US census report on computer and internet use in the United States, 2013 saw 83.8

percent of American households with ownership of a personal computer, and 73.5

percent with a high speed internet connection [23]. The International Telecommuni-

cations Union, an agency under the supervision of the United Nations, found that in

2015, 3.2 billion people are connected to the internet—a 600 percent increase since

the year 2000 [28]. With the advent of cloud computing infrastructures and browser

based applications replacing traditional desktop applications, security in the Internet

application sphere is a paramount concern of a corporation’s security posture. Thus,

the development of strategies for discovering errors and vulnerabilities in software is

an open area of both academic and industry research.

Computer security concerns have long been a part of the equation in delivering

useful and reliable software systems to end users. The first class of vulnerabilities

to emerge in consumer software systems were related to memory corruption bugs

in programs, such as Bu↵er/Heap Overflows and format-string vulnerabilities [22].

These vulnerabilities were the result of logical programming errors that allowed an

attacker to arbitrarily write to or read from memory which should be unavailable,

potentially leading to a full compromise of the system [14]. Despite the severity of such

bugs, early on, the impact of these errors was not substantial to end users. However,

1

as time progressed, and users begin to trust software companies with private and

commercial data, those vulnerabilities proved very damaging, and forced developers

to adopt secure coding practices.

The modern landscape of cybersecurity involves many of the same vulnerability

pitfalls of the past 30 years, as well as a litany of new attack surfaces due to the

ubiquity of the Internet activity and mobile devices. Remote code execution vulner-

abilities in web services, such as SQL injection and Cross Site Scripting (XSS), have

exploded in frequency in the past decade, and have dire implications for users and cor-

porations. Figure 1.1 shows the top four most common threats to web applications as

determined by the Open Web Application Security Project (OWASP) [41]. Because

of the widespread availability of software services, and the integration of third-party

tools and libraries, it becomes a nontrivial problem to manage the security posture of

a given application. In order to maintain a tenable consumer market base, software

companies must devote resources and manpower to develop standard procedures for

securing software. One such standard, the Microsoft Secure Development Life-cycle,

shown in Figure 1.2, demonstrates that security awareness is required in all stages

of the development process—from training developers and technicians to the design,

implementation, and maintenance of production software [8].

Software companies and service providers are responsible for rigorously uncov-

ering and fixing bugs in software, and rely on a variety of tools and techniques to

accomplish this task. Testing software for errors is vital to the software development

life-cycle. Techniques such as manual code review and static analysis provide some

insight into the behavior of an application. Unfortunately, they are insu�cient when

codebases become too large, or if visibility into source code or program internals is

limited. Automated testing seeks to fill this void by quickly testing input vectors of

applications and monitoring their handling of that input. This process faces some

2

Figure 1.1: OWASP Top 10 web vulnerabilities shows the frequency of web-basedinjection attacks, and the importance of defending against them

unique challenges—first, it must seek out all the possible branches of execution (cov-

erage). Second, It should also be equipped to determine if a program reaches an

unsafe state (or crashes). Third, it should develop a process for crafting the input in

an intelligent way to stress test input vectors without exhausting too much time.

1.1 Fuzz Testing for Vulnerability Discovery

One of the most e↵ective strategies for auditing software services for vulnerabilities

is fuzz testing. Fuzzing is a technique for automatically generating crafted input,

sending it to an application, and monitoring the behavior of that application in order

to ascertain if a given input causes undefined (or nefarious) behavior [44]. The tech-

nique was developed by Barton Miller et al. at the University of Wisconsin. Miller’s

research team developed programs that crafted randomized input to test how com-

mand line utilities commonly found in UNIX-based operating systems would react.

They were able to uncover errors in 25-33 percent of UNIX utilities [47], ushering in

a new paradigm in software testing.

3

Figure 1.2: The steps of the Microsoft Secure Development Life-cycle

From its inception in the late 1980s as a technique for testing UNIX utilities for

defects [47], fuzz testing has become an established technique for discovering bugs

and security vulnerabilities in software. Fuzzers have been able to uncover serious

vulnerabilities in everything from file parsers and language interpreters to network

protocols and binaries [44]. Di↵erent fuzzing strategies have advantages based on the

target in question—mutation-based fuzzing, in which in various portions of a valid

test case are mutated (such as bit flipping for binary data) can uncover di↵erent sorts

of software errors than Generation-based fuzzing types, which create new test cases

based on a model of expected input. Although simple fuzzing strategies still uncover

software bugs today, the advent of modern security measures for various vulnerabil-

ity classes has forced researchers to abandon simplistic fuzzing strategies in favor of

intelligent systems that optimally traverse the input search space. Modern fuzzing

strategies involve taking steps to reverse engineer the target, in order to discover

possible execution paths [42]. This situation is not always tenable—because of the

feedback limitations of black-box fuzzing for targets such as web applications, alter-

native instrumentation techniques are required. Because the search space involved in

fuzz testing is vast and often di�cult to define, guided search using heuristics is a

reasonable alternative.

4

1.2 Fuzzing with Evolutionary Algorithms

This thesis explores a fuzzing strategy that addresses this problem using Evolutionary

Algorithms, guided by fitness metrics based on the lexical and semantic structures

represented in a corpus of positive examples. Classical fuzzing methods seek to gen-

erate fuzz data that targets boundary and format assumptions of input parsed by a

target program. For this problem, the fundamental structures of known-bad attack

strings are analyzed, and its components are used for the creation of new test cases.

Evolutionary Algorithms are a good fit for optimizing fuzzing, because they can be

easily incorporated in the process of input generation, and aid in the reduction of the

search space. Scoring candidate solutions based on semantic structure gives an Evo-

lutionary Algorithm enough structure to avoid perpetuating nonsensical candidate

solutions while allowing for freedom to discover new payloads that uncover software

vulnerabilities. The goals and advantages of the framework developed by this thesis

include:

1. An intelligent reduction of search space and combinatoric complexity

Enumerating all possible permutations representing positive examples for even

reasonably sized EA candidates quickly becomes infeasible. Grouping n-tuples

of symbols into production rules of a grammar that accept candidates of a given

type allows for more intuitive evaluation of evolved payloads.

2. The generation of unique exploit candidate solutions This system uses

an Evolutionary Algorithm to generate and select candidates through previously

discovered solutions. Using the searching conventions at the core of Evolution-

ary Algorithms, coupled with the language-based fitness metrics derived from a

corpus, this system attempts to generate payloads that are unique, but seman-

tically similar to positive examples of known bad payloads.

5

3. Development of a language-theoretic basis for guiding an EA fuzzing

framework The fitness function at the core of the proof of concept fuzzing

framework uses grammars representative of curated positive examples. These

productions approximately describe the entire corpus of known-bad injection

strings. The utilization of these grammars combined with traditional application

response monitoring provides a formal manner by which to verify the fitness of a

given solution, and can be extended to other languages and frameworks capable

of semantic analysis.

The research described in this thesis is significant for several reasons. First, the

generation of semantic-structure groups learned from positive examples will guide the

Evolutionary Algorithm based on lexical tendencies of known exploitative input. Sec-

ond, the approach can use those semantic structure groups to both measure fitness

and generate payloads, creating more intuitive search guidance for the Evolutionary

Algorithm fuzzing framework. Lastly, this research represents a step towards auto-

matically building formally expressible grammars that can generate good candidate

solutions, and can be extended to test other software systems and protocols.

Consider a large scale end-user web application service that utilizes both front-

end browser-based technologies (Javascript) and back-end data stores (e.g., MySQL).

As the size of the code base increases, manual code review may fail to identify even

commonplace errors. Although static analysis and manual code review can reme-

diate the existence of some vulnerabilities, the source code of an application is not

always available. Black-box fuzzing—the term used to describe a testing scenario in

which the source code of an application is unavailable—requires less overhead than

grey/white-box methods (which have some or complete visibility into application in-

ternals, respectively). The key limitation of black-box fuzzing is its lack of source code

knowledge, and limited visibility into application internals. This research intends to

6

optimize black-box fuzzing by using the learned grammar structures of positive exam-

ples to promote semantically intuitive solutions and suppress non-conforming input

generated by the Evolutionary Algorithm.

The fuzz testing campaign is managed by an Evolutionary Algorithm (EA), a

searching strategy inspired by principles of biological evolution. Evolutionary Algo-

rithms search for better solutions by discovering new candidates through the recom-

bination of “fit” solutions in a population. Each chromosome (a possible solution)

consists of either a sequence of grammar symbols representing an injection string,

or grammar transitions that generate a given injection string. The Evolutionary Al-

gorithm utilizes selection, crossover, and mutation to perpetuate new generations of

chromosomes. The central idea that makes evolutionary algorithms e↵ective is that

chromosomes with higher fitness scores will be more likely to produce o↵spring (i.e.,

fitter chromosomes are probabilistically more likely to be selected for creating the

next generation). Crossover involves combining two chromosomes in a manner to

produce o↵spring for the next generation. In the context of our system, chromosomes

will recombine grammar symbol groups or transitions in order to guide searching the

input space based on the semantic information the chromosomes encode. Mutation of

chromosomes is modeled by randomly changing of symbols or grammar productions

in a given chromosome. Mutation is a necessary convention for maintaining diversity

within a population to avoid stagnation or convergence on a suboptimal plateau in

the search space.

1.3 Contributions

The research outlined in this document contributes in the following ways:

1. Produces a framework for fuzzing at the application-level using Evolutionary

Algorithms (EA) to evaluate and create input.

7

2. Explores the utilization of grammar-based fitness evaluation for guiding an Evo-

lutionary Algorithm-based fuzzer.

3. Analyzes the e↵ectiveness of the proof of concept solution compared to other

fuzzing methods, such as brute force, hand-selected payloads, and Markov

Model Monte Carlo input generation.

The thesis proceeds as follows: the second chapter discusses the history and compo-

nents of fuzz testing, various techniques for uncovering vulnerabilities with automated

testing, and the advantages and limitations of those methods. The third chapter de-

scribes Evolutionary Algorithms through discussion of their various manifestations

and use cases. The fourth chapter outlines the incorporation of EA’s into fuzzing

frameworks, and the strategies that represent the proof of concept approach to EA-

based application fuzzing. Chapter five explains the testing environment by which

our methods are evaluated, including an analysis of the results. Finally, chapter six

draws conclusions based on the results of our testing, and discusses avenues of future

research.

8

Chapter 2: Fuzzing

2.1 History of Fuzzing and its Fundamentals Com-

ponents

The process of testing software for logical errors and security bugs has long been a

part of the recommended software development life-cycle [38]. Most nontrivial systems

depend on a vast number of moving parts. From the design and development stages

all the way to production deployment and maintenance, many di↵erent people write

many lines of code, creating a situation in which errors are unavoidable. In order to

alleviate the impact of serious bugs in actively developed software, rigorous testing

of software is vital. For example, regression analysis ensures that new features and

changes made to an existing codebase do not introduce new (or previously observed)

software errors. Static code analysis performs source code checking to detect common

errors that introduce security-related vulnerabilities. Both of these approaches are

necessary weapons in the testing arsenal, but sometimes fail to account for unexpected

input, or detect incorrect implementations of correct processes. Automated input

testing fills this void by sending crafted input to a running example of the program,

monitoring the program’s response for undefined behavior or traversal into an unsafe

or inoperable state. Although dynamic analysis of software can be time consuming

and resource intensive, it allows developers to cover the spectrum of necessary tests

for ensuring software security when used in concert with static and manual analysis.

One such methodology of this testing ilk is referred to as fuzzing.

Fuzz testing was o�cially formalized in an academic setting by Barton Miller

et al. at the University of Madison, Wisconsin [47]. The idea of sending random

input to popular UNIX utilities was first explored when a thunderstorm tampered

9

with Miller’s remote connection, sending random character input to his terminal and

crashing the UNIX programs in use. Because the thunderstorm’s interfered with the

modem used for the remote session, Miller called the idea “fuzzing” [50]. Inspired

by the phenomenon, Dr. Miller designed a lab for the graduate students in his oper-

ating systems course, instructing them to write programs which send random input

to common UNIX utilities and monitor the results. One student’s submission un-

covered parsing vulnerabilities in many command line utilities that were at the time

considered stable. Thereafter, a software testing group formally explored fuzzing

and demonstrated the widespread input handling errors that plagued many UNIX

programs [47]. Today, fuzzing is an indispensable tool for security researchers and

software development teams responsible for testing and quality assurance. In order

to successfully deliver a fuzz testing campaign, fundamental components must be

in place for the creation of input data and monitoring program behavior. General

steps involved in fuzz testing methods are shown in Figure 2.1. Sutton et al. in

their comprehensive text on fuzzing methodologies, stipulate that although fuzzing

methodologies vary widely, following certain guidelines are more likely to guarantee

results [44].

The first step involved in the fuzzing process is to identify a target for the testing

campaign. In the literature, the target in question is often referred to as a “System

Under Test” (SUT) [39]. Although it is fair to say that any software system that

receives and processes user input is fair game, fuzzing is most e↵ective when input

directly a↵ects the program’s state in a measurable way. For example, fuzzing is

less e↵ective for identifying problems in cryptographic methods, because monitoring

vulnerabilities is di�cult, as opposed to fuzzing file format parsers, where the e↵ects

of the input (a crafted file) on the application are easy to monitor [44]. Fuzzers

are especially good at identifying implementation problems in parsers—programmers

10

Figure 2.1: The general steps involved in fuzzing campaigns

often make assumptions that expose applications to risks in edge cases, which lead

to problems in everything from memory corruption to remote code execution of web

apps [43,51]. The best targets for fuzzing require knowledge of the protocol, by what

means input is interpreted, and a reliable manner way of determining an unsafe state

(more on this later).

The second step in the fuzzing process requires one to identify input channels

of a target application [44]. Identifying the input channels of a target application is

vital to reliably carry out a fuzzing campaign—otherwise, there would be no reason to

conduct fuzz testing in the first place. Although this point is obvious, the more subtle

implication of this step is that when designing a fuzzer, it is vital to have knowledge

of the application’s points of input (which are, in essence, its attack surface). Often

times, software errors are the result of incomplete or incorrect assumptions about

input data, and fuzzers are an e↵ective technique for testing those errors. To this

11

point, model-based security testing is often combined with fuzzing, since modeling

the flow of an applications reveals its attack surface e↵ectively [20]. Identifying the

set of inputs for a given target also informs the manner in which data generation

is conducted—knowing the protocols involved or the structure of input allows for

intelligent design decisions (e.g., one would fuzz file format parsers in a completely

di↵erent way than an input login form for a web page). Furthermore, even with

simple fuzzing techniques such as random mutation, it is vital to preserve fields of

input that an application interprets in order to be e↵ective [44].

In order to make certain that a given testing campaign is properly searching the

input space within a reasonable amount of time and computing resources, the task of

data generation is at the heart of an e↵ective fuzzing strategy. However thorough it

may be, enumerating all possible inputs for a given target would be untenable from

the standpoint of time and computing power. Target information and a knowledge of

the attack surface (determined by the inputs available for a given target application)

guide the components available for data input generation [44]. As an example, for

file formats parsers and network protocols, specifications for the format of data will

be agreed upon as standard and (hopefully) documented in an RFC or equivalent

text. Even for proprietary protocols, it is possible to capture tra�c or examples and

make assumptions about the underlying strucuture [44]. It is important to generate

negative test cases that the SUT will evaluate as structurally expected input—Holler

et al., in their framework for fuzzing web browser Javascript interpreters, were able to

uncover numerous bugs in Javascript interpreters by requiring their fuzzer to generate

test cases that were syntactically valid [36]. The e�cacy of protocol fuzzers falls in

the same category, as the preservation of packet structure is vital for testing the

underlying processing frameworks instead of being cast aside by basic error checking.

In order to achieve quality test results, the data generation stage must be sensitive

12

to the protocol at hand. For network protocols in particular, RFC information will

allow a developer to determine the structure of a packet (and a stream of packets).The

various structural components of the protocol will be the places in which to insert

generated data, so it is vital for the fuzzer to comply with the standard set for field

widths and sequential order.

After those steps are completed, the crafted data is sent to the application. Sutton

et al. explains in great detail a variety of input types where fuzz-generated data

should be injected for a given target [44]. For local applications such as desktop

binaries, command-line arguments and environmental variables are prime channels

by which to submit fuzzed data. Remote fuzzing campaigns, such as fuzzing an

FTP session or web applications, involves slightly more setup—fuzzer-generated input

sent over networks is often done by a tool such as scapy or using a combination of

browser emulator and an HTML parsing library [1,12,13]. E↵ective fuzzing campaigns

require information gathered from target application analysis in order ensure the

application does not ignore the crafted input because of failure to meet the required

data specifications. In fact, when a feedback loop is limited, in the case of remote fuzz

testing, it is important carefully format test cases. In the case of our research, instead

of relying on application feedback to develop properly structured test cases, semantic

structures from positive examples are evaluated and used to intelligently guide test

case generation. The e�ciency of this approach is discussed in future chapters.

Once the fuzzer’s crafted input has been sent to the application, it is necessary

to monitor the SUT for its response, and that the results are thoroughly recorded

for further analysis. In the ideal environment, a monitoring harness is watching the

execution of the target system, and is alerted when exceptions, crashes, or interrupts

are invoked. In the event that a fuzzer sends a negative test case that causes an

error, for the sake of reproducibility, metadata regarding the program’s state as well

13

as the crafted input itself should be logged for analysis [44]. The human element

of fuzzing in this stage is involved in analyzing these events to identify whether or

not the root cause of an error has security implications. Although utilities such as

!exploitable are useful for attempting to measure exceptions and crash states according

to their potential exploitability, they are still limited [52]. These methods are adept at

discovering exploit potential in memory corruption errors, but there is a fundamental

limit to what computer programs can determine regarding the security posture of

another system. At this stage, the results of the fuzzing campaign are best left for

human judgment and inspection.

The e�ciency of a fuzzer is typically measured according to how well it finds bugs,

and this is directly correlated to a fuzzer’s ability to explore the execution paths of a

target application. This is referred to as “code coverage” and is a common metric for

measuring the success of a fuzzer. Sutton et al. remark that this is a measurement

of the “amount of process state a fuzzer induces a target’s process to reach and

execute” [44]. Fuzzers attempt to find vulnerabilities and crash states according to

mishandling of user input, and the best way to measure if a fuzzer is likely to uncover

such a bug is to measure the percentage of execution paths that are covered. Another

important piece of measuring the e�cacy of a fuzz testing campaign or technique

is at the monitoring stage—error detection is vital to determining if a given input

causes a crash via a debugger or other heuristics. Finally, resource constraints are

ever-present, and require the developers of fuzzers to design and implement e�cient

code. All of these things together are used as a measuring stick for fuzzing tools and

techniques.

14

2.2 Present Day Fuzzing

2.2.1 General Purpose Tools and Techniques

Miller et al.’s first paper on the use of testing UNIX utility reliability with random

input demonstrated the simplicity of software testing at that time. Before then, soft-

ware systems were primarily tested for accuracy and e�ciency down “happy paths”,

or under execution environments that were expected according to the specifications

of the software itself. Although these testing methodologies are able to determine

whether or not a piece of code performs calculations correctly and e�ciently with

expected input, it does nothing to examine the manner in which a program handles

malformed (or malicious) data. The arrival of fuzz testing brought with it a new

mindset on testing approaches, and the responsibility of programs to reliably and

safely handle input.

In concert with Miller et al.’s approach, the first fuzzing tools were concerned

with injecting random data as input to applications, monitoring for exceptions and

crashes [44]. In the early stages, this method was very e↵ective: most software was

not written defensively, or with any sort of security mindset. Therefore, fuzz testing

was adept at triggering errors such as memory access violations because they placed

programmer assumptions under stress through unexpected input. Although today

these methods would be considered elementary, they set the stage for academic and

industry research for intelligent, informed software testing. Mutation-based fuzzing is

the name for the process of taking a valid input sample (one which would be correctly

parsed by the program in question) and mutating it semi-randomly, and using it as

a negative test case against a system under test (SUT). The primary advantage of

this method is speed: minimal setup is required, and because little or no time is

spent modeling the structure of the data, fuzzing campaigns are executed relatively

15

quickly. File formats and plaintext protocols with easily identifiable field values are

prime targets for mutation fuzzing. One such manifestation ofMutation-based fuzzing

tools is zzuf : this fuzzer intercepts network tra�c and file formats and performs “bit

flipping” on program input, which literally is the act of randomly changing a variable’s

bit from “0” to “1” (or vice versa) [18]. The disadvantage of this system is that the

success of this method is contingent up on the quality of the available samples [6].

Because of the chaotic nature of randomly flipping bits of input, the fuzzer runs the

risk of mangling the test case past the point in which the target program will even

accept it. The deficiencies of pure Mutation-based fuzzing prompted researchers to

develop fuzzers that made use of data modeling and other analytical approaches.

The other main category of fuzz testing strategies is Generation-based fuzzers.

Generation-based fuzzers seek to make a model of the data accepted by a target

application, and comply with its specification (or violate it, but intelligently) when

injecting crafted input. Sutton et al. refer to Mutation-based fuzzers as “dumb brute

force”, and Generation-based fuzzers as “intelligent brute force”: although the input

crafting mechanism for both methods rely on randomly changing input, Generation-

based fuzzers go to great lengths to ensure a test case follows the specification of a data

model [44]. Peach, a popular Generation-based fuzzing tool, requires a user to create

data models for the framework to use as guides for its fuzzing engines [11]. These

“peach pits” are the basis for the input generation phase of the fuzzing campaign.

An example is show in in Figure 2.2. The Peach fuzzing framework requires an XML-

structured description of the data model, as well as type-specific information useful

for fuzz testing campaigns (e.g., field length, expected content type, delimiters, etc.).

Although clearly Generation-based fuzzing requires more work up front to model the

data in question, its e↵orts are not without results. Miller and Peterson demonstrated

that for modern targets, Generation-based fuzzing was much more e�cient, able to

16

Figure 2.2: An example excerpt of a peach pit used for Generation-based fuzzing [11]

achieve coverage 76 percent better than mutation-based fuzzing of the same targets

[48].

One of the most e↵ective methods of boosting the e↵ectiveness of these algo-

rithms involves boundary checking. Boundary checking is the act of testing values

at the edges of expressible values in order to trigger vulnerabilities such as Integer

Overflows, which lead to access violations of memory and can lead to a full compro-

mise of the target system [44]. When these tests are included with random mutations

and inside fields specified by a data model in Generation-based fuzzing, the likeli-

hood of triggering vulnerabilities increases exponentially. A figure from Sutton et

al. demonstrating common boundaries to test is shown in Figure 2.3. This example

shows a group of interesting values for testing the boundaries of certain data type

widths (MAX32, for example, referring to the maximum value for a 32-bit integer).

Test case generation for classical fuzzing seeks to create values with high potential to

cause problems or expose improper programmer assumptions, and boundary testing

with the value show in Figure 2.3 is an e↵ective means to this end.

17

Figure 2.3: Boundary testing recommendations according to Sutton et al. [44]

2.2.2 Modern Fuzzing

Modern fuzzing techniques still follow the same basic techniques described above, but

with more refined metrics for providing intelligent searching of execution paths and

data generation. Reverse engineering binaries and file format parsers to enumerate

its execution paths has recently been a standard method for optimizing fuzzing. For

local binaries, it is possible to attach debuggers to running processes not only to

monitor changes to the target application, but make sense of its execution paths [44].

Seagle, in his framework for file format fuzzing, made use of a reverse engineering

framework which found the execution paths of his target and fed that information

back to the fitness function of his Genetic Algorithm [53]. The class of fuzz testing

techniques that attach debuggers to running processes and use that channel to in-

ject test cases is called in-memory fuzzing. In-memory fuzzers are very e↵ective for

their visibility into application internals, and the fact that the possibility exists to

set a breakpoint, save the machine state, execute crafted input, and return to the

previous place, allowing for fine-grained control of executing the program with sent

18

data and monitoring its response [44]. Vincenzo Iozzo demonstrates that expensive

reverse engineering operations for preprocessing can be avoided by applying function

hooking during the fuzzing process, and pruning execution paths explored through

the combination of real-time debugging and heuristics. By measuring the cyclomatic

complexity of a given function and performing loop detection, fuzzers can be use

these heuristics to search program execution paths more e�ciently, thereby achieving

more e↵ective code coverage [40]. Other approaches to fuzzing involve measuring the

influence of injected data into program state, known as taint analysis. This method is

simply the process of marking process states where untrusted data has been injected

or evaluated, in order to help fuzzing heuristics make better decisions regarding data

generation and code coverage. Bekrar et al. demonstrate that traditional fuzzing

frameworks informed by metaheuristics and taint analysis allows for more e�cient

determination of exploitable bugs [19]. Iozzo also relies on this method for his frame-

work, which allows for the measuring of data’s propagation through the execution of

a program [40].

2.2.3 Web Application Fuzzing

Application level fuzzing for web applications has uncovered bugs ranging from mem-

ory corruption vulnerabilities in underlying system software, to data exfiltration and

session hijacking via SQL injection and Cross-Site Scripting (XSS) [44]. For the pur-

poses of this thesis, a web application or web service (used interchangeably in this

document) is a computer program that is executed on a remote server which responds

to clients, that connect via the HTTP protocol. The web application targets of inter-

est in this research are those which receive and process input from a client. Fuzzing

web applications for injection vulnerabilities is intuitive, because the process of iden-

tifying a target’s attack surface is trivially easy. Furthermore, although application

19

internals are not usually available, much work has been done in curating sets of e↵ec-

tive injection strings for a wide range of web vulnerabilities [49]. Despite its intuitive

nature from the standpoint of target identification and input generation, application-

level fuzzing of web services su↵ers from a fatal flaw—most of the time, visibility into

a target application’s internals is limited or nonexistent. This means that measuring a

fuzzer based on code coverage requires monitoring capabilities that are not available.

Even still, the amount of work required to set up such a system must force testers and

researchers to question whether or not other testing methods (or manual review) are

more suited to the task at hand. Thus, the monitoring requirement of fuzzing must

be modified in order to determine whether or not a vulnerability has been uncovered.

At the application level, this could simply involve parsing the resulting HTML for a

desired response, such as in Duchene et al. [26].

Most web application fuzzing frameworks contain a crawler that finds webpages

and potentially vulnerable input forms, and a set of known bad payloads that encom-

pass typical exploit vectors, such as directory traversal, SQL injection and Cross-Site

Scripting (XSS) [44]. Tools such as Burp Proxy have the ability to spider an appli-

cation and apply a set of test cases to a given input form [2] Another widely popular

web application security tool is w3af, a web service scanner that uses a curated set of

well-known injections to test for common web application vulnerabilities [17] In the

same vein, the now inactive JBroFuzz provided a general purpose GUI-based fuzzing

framework that covered a range of scanning techniques for discovering web applica-

tion vulnerabilites [9]. Most of these tools are adept at covering low-hanging fruit,

and do not provide any exploitation information—they only can determine if a given

entry point is potentially exploitable.

A large majority of research questions explored regarding fuzzing and web appli-

cation targets involve testing a web service for client-side Javascript execution bugs.

20

Tripp et al., in their research on optimizing Cross-Site Scripting vulnerability test-

ing, parsed the output following the execution of a negative test case, and used the

information to prune their set of test cases, culling a large set of positive examples

based on response of web app [57]. Their research shows that the previously sig-

nificant problem of having a limited feedback loop for fuzzing web applications can

make use of heuristics and response analysis to guide input generation. By charac-

terizing their large corpus of Cross-Site Scripting (XSS) examples according to their

tokens, injected payloads that were filtered or otherwise rejected inform the next test

cases attempted. Tripp et al. demonstrate the ability to uncover vulnerabilities by

e�ciently pruning the space of test cases from their original corpus [57].

The technique of using heuristics to generate test case is very similar to taint

analysis, and is a popular method of information gathering in the monitoring stage of

the fuzzing cycle. Model inference testing attempts in the fuzz testing space attempts

to determine the impact of a negative test case upon a target application. This

information can be utilized as feedback for the intelligent generation of new test cases.

Duchene et al. used taint analysis along with a grammar-based genetic algorithm to

uncover Cross-Site Scripting (XSS) vulnerabilities [26]. Wang et al. used a hidden

Markov model based on Bayesian probability distributions to to generate test cases

for uncovering Cross-Site Scripting (XSS) vulnerabilities. Their work theorizes that

injections are the combination of attack vector and payload, making the primary goal

to determine the attack vector necessary for injecting a payload. Similar to Tripp et

al.’s research focus of tree pruning based on application response to injection, Wang

et al. attempts to learn from the target application’s response to crafted input and

probabilistically generate new injection payload according to Bayesian probability

distribution [59]. Although the results of their methods contained numerous false

positives, their approach has merit—building a probabilistic model for generating

21

tokens allows for search space flexibility not o↵ered by Tripp et al. [59]. This covers one

of the main disadvantages of common fuzzing frameworks, which were not historically

designed to intelligently guide the manner in which test cases are generated.

Figure 2.4: A vulnerable input form that can be exploited using SQL injection

An example of a PHP script vulnerable to SQL injection is listed in figure 2.4,

courtesy of a purposely vulnerable web application made for testing and training

called DVWA [5]. In this case, a PHP script receives the id parameter from a GET

request. The user controlled data in the $id parameter is evaluated on the server

machine as raw SQL, allowing for a malicious user to execute raw commands. This

can lead to data exfiltration, or even complete compromise of the backend machine.

In particular, this case demonstrates a system oblivious to security concerns. Most

modern web applications contain some measure of security, in the form of blacklists or

regular expressions that attempt to filter out and/or detect malicious input. Hansen

and Patterson show the ine↵ectiveness of using regular languages and pattern based

filters to defeat malicious input that is by nature context-free [34]. In the spirit of

22

their development of a language theoretic basis for security, this research attempts

to guide fuzzing based on the lexical structure of positive examples. By focusing on

approximating the “attack language” for a given class of vulnerability, this research

explores the use of language theory to optimize fuzz testing, and move towards a

verifiable language-based reasoning for the e↵ectiveness of certain test cases.

2.3 Fuzzing and Genetic Algorithms

Evolutionary Algorithms are a prime candidate for optimizing fuzz testing by in-

fluencing input creation in an intelligent way. Many researchers have successfully

incorporated Genetic Algorithms into their fuzzing frameworks for targets ranging

from file formats to network services and web applications [26, 42, 56]. Over the past

decade, fuzz testing of applications which utilize evolutionary algorithms to intelli-

gently guide input generation have seen tremendous success in both academic and

applied settings [32, 53]. Thanks in no small part to the power of modern hardware

and availability distributed systems, the complexity concerns that once rendered the

use of genetic algorithms ine�cient are no longer prohibitive [53].

Sherri Sparks et al. proved the value of Genetic Algorithms in optimizing solution

space searching for fuzzers in the development of their program called SIDEWINDER

[56]. After disassembling the System Under Test, execution paths of interest are enu-

merated based on whether or not it contains an unsafe function call [56]. Then

subgraphs containing these functions of interest (because of their propensity to be in-

volved in unsafe operations), are separate for further analysis. The next step involves

their Genetic Algorithm—each chromosome encodes production rules of a Context-

Free Grammar (CFG), which are then used in conjunction with probabilities of path

traversal across known problematic subgraphs [56]. At every point of execution for

a given negative test case, the probabilities of going to the next node in the graph

23

are calculated. Fitness is boosted if new edges are explored, and the Markov Model

heuristic used at the core of their fitness function is updated [56]. This research

demonstrates the e�cacy of using Genetic Algorithms and Context-Free Grammars

to create new test cases based on program feedback. Similarly, DeMott et al. per-

formed grey-box evolutionary fuzzing on targets, but use a more traditional Genetic

Algorithm search heuristic [25]. Grey-box fuzzing assumes that source code for a tar-

get application is unknown, but binary internals (including assembly code) is available

for analysis. In the same manner as Sparks et al., DeMott et al. first reverse engi-

neering the target target application to locate and categorize its execution paths [25].

Based on a valid sample of a test case, it builds a population and performs traditional

Genetic Algorithm operations on individual chromosomes. Fitness is scored based on

how many branches of execution a given test case follows, and its distance between

the current node of execution and a desired target (determined during static analysis).

Test cases that find new branches are promoted especially among a pool of candidate

solutions [25]. The work of Sparks et al. and DeMott et al. represents an important

step in using Genetic Algorithms and a heuristics feedback loop to optimize fuzzing

strategies.

Roger Seagle explored the e�cacy of incorporating a nonstandard Genetic Algo-

rithm called CHC to perform file format fuzzing [53]. His fitness heuristics combine

execution graph heuristics as well as considerations regarding function characteristics

(e.g., the number of arguments and local variables, number of assembly instructions,

etc.) [53]. A catalogue of the fitness function considerations is shown in Figure 2.5.

The resulting work, a distributed fuzzing framework entitled “Mamba” is a collec-

tion of Genetic Algorithm-based fuzzing strategies that were able to find more unique

defects than comparable file format fuzzing tools [53].

As mentioned in the previous section, Fabien Duchene et al. demonstrate the abil-

24

Figure 2.5: Fitness heuristic categories considered by Seagle [53]

ity of Genetic Algorithms alongside model taint analysis to produce fuzz data that

uncovered Cross-Site Scipting (XSS) attacks in a reliable manner. Cross-Site Script-

ing vulnerabilities emerge when front-end web code does not safely escape dynamic

HTML and Javascript. Using a crafted input string, an attacker can execute code that

can be used to perform session hijacking and remote code execution [4]. Duchene et al.

used a Genetic Algorithm and taint-based heuristics to perform fuzzing on a variety

of purposefully vulnerable testing websites [26]. Similar to Sparks et al., this research

encoded chromosomes of a population as productions of a manually written “attack

grammar” tailored to uncovering Cross-Site Scripting (XSS) vulnerabilities. In this

document, the term “attack grammar” describes a grammar used for the creation of

strings with a propensity towards uncovering a class of vulnerabilities related to input

injection (e.g., SQL or LDAP injection, Cross-site Scripting, etc.). This grammar can

be developed manually by an expert, or inferred from positive examples. Although

it is impossible to exhaustively account for all the strings of a given language which

uncover injection vulnerabilities, human intuition (or automatic inference from posi-

tive examples) can approximate the fundamental grammatical components involved.

Duchene et al. succeeded in outperforming such tools as w3af and JBroFuzz [26].

An excerpt of the attack grammar is shown in Figure 2.6 [26]. The open problem

25

of automatically deriving attack grammars discussed in this work is the basis for the

research discussed in this thesis.

Figure 2.6: An excerpt of a manually-written attack grammar for finding Cross-SiteScripting vulnerabilities [26]

2.4 Grammar Fuzzing

The term Grammar Fuzzing refers to the subset of fuzzers whose targets are language

parsers, compilers, and runtimes. Grammar fuzzing has been successful in uncovering

a high number of web browser bugs and vulnerabilities due to incorrect Javascript

parser implementations. Fuzz testing against language interpreters has a long history

of success at finding parser vulnerabilities in software. One of the most frequent

targets of this type of fuzzing is the web browser. Zalewski’smangleme browser fuzzers

is one of many tools aimed at testing a browser’s Javascript parser for implementation

bugs [44]. The mangleme fuzzer was a browser fuzzer design to cause crash states

26

as a result of improper handling of HTML input. Guo et al., developed a technique

for testing Javascript parser engines by taking fragments of valid Javascript code,

and reorganizing them to produce negative test cases [33]. Holler et al. developed

a similar tool for testing web browerser Javascript parsing engines in Firefox [36].

Their fuzzing framework was incorporated into a regression testing suite for Firefox.

The fuzzer executed code against a new release by generating negative test cases that

were produced by recombining fragments of Javascript code to make new test cases

by which to test a given interpreter [36]. Using this method, their team uncovered

160 bugs in the Mozilla Firefox browser’s Javascript parsing engine [36]. In parallel,

but unrelated work Yang et al. developed a language fuzzing tool called “blendfuzz”,

which used “grammar aware mutation” to take valid test cases and rearrange valid

subgraphs to produce test cases that test a language interpreter’s correctness [61].

2.5 Advantages and Limitations

Fuzz testing is an e↵ective technique for uncovering errors in software systems that

process user-controlled input. This has special gravity in the context of a program’s

security posture—oftentimes, fuzzers find bugs that lead to an attacker being able to

craft input that can exfiltrate sensitive data, cause DOS’es, or lead to remote code

execution.

One of its most important advantages is the speed with which fuzzing frameworks

can find software bugs and vulnerabilities. Despite there being the propensity for long

execution times, it is still much faster than employing humans to comb through huge

codebases in search of bugs. Furthermore, operating tests on a live manifestation of

an application uncovers bugs that are lost in manual code review and other testing

tasks subject to human fallibility. The modern day fuzzing landscape brings with it

a variety of frameworks and tools with specialized targets, allowing for developers to

27

easily incorporate fuzzing into their software testing methodologies. Fuzzers are adept

at finding “low-hanging fruit” of vulnerability classes, and are for general purpose

vulnerability assessment. Fuzzers also add a dimension to software testing by creating

inputs that humans would not be likely to conjure up themselves. This research

demonstrates the usefulness of software testing that combines heuristics which encode

“intuition”, with the freedom of Genetic Algorithm solution searching.

Fuzzing is not without its limitations, however. For starters, fuzzing techniques

are only capable of alerting an exception or crash state being triggered—not whether

or not a vulnerability is in progress. This makes fuzzers ine↵ective regarding the

discovery of complicated, multi-step vulnerabilities [44]. Another limiting aspect of

fuzzing is the fact that general purpose crash analysis, for the most part, remains

a manual human endeavor. Despite humans being able to intuit the exploitability

of a given software bug, there is a fundamental limit to the ability of computer

programs to determine whether or not a software error represents a vulnerability

that can be compromised. In respect to Generation-based fuzzers, or any fuzzers

dependent on a data model, complexity increases exponentially as a function of input

specification. In other words, the more complex the data model becomes, the more a

fuzzer will consume design and computing resources. Oftentimes, ensuring an e�cient

fuzz testing campaign can be impossible because the search space can be too large to

enumerate. Thus search space optimizers such as Genetic Algorithms have recently

become en vogue.

Fuzz testing for software defects and vulnerabilities has been proven to work across

a variety of targets and protocol specifications since the late 1980s [44]. Although it

has limitations—not the least of which, the monitoring of applications for undefined

behavior and the analysis of negative test cases post-campaign—fuzzing has secured

itself as a mainstay technique for security researchers and software testers. Rudi-

28

mentary fuzzing techniques, such as random Mutation-based and Generation-based

fuzzing are surprisingly e↵ective at uncovering critical vulnerability classes. Modern

research focuses on informing fuzzing methods with heuristics to intelligently guide

test case generation, in order to combat dumb fuzzing’s inability to learn from the

SUT’s response to previous test cases.

29

Chapter 3: Evolutionary Algorithms

3.1 History

Evolutionary Algorithms describe the set of algorithms—typically, focused on func-

tion optimization and search space reduction—whose fundamental components are

inspired by the phenomenon observable in evolutionary processes in the natural world.

The ideas and inspiration that ushered in the emergence of evolutionary computing

goes back to the mid 20th-century. Alan Turning, in “Computing Machinery and In-

telligence”, the same text in which he famously proposes his “imitation game”, specu-

lates a scenario in which a machine would be modeled after “the mind of a child”, with

the ability to receive sensory input, learn from stimuli, and use that prior information

to make inferences and conclusions regarding new encounters [58]. His description

of the learning capabilities of machines is steeped in the language of evolution, and

the fact that his analogy uses this language helps explain the rise of evolutionary

algorithms, and frames the way computer scientists conceived problem spaces in the

mid-20th century. Evolutionary Algorithms describe the subset of optimization algo-

rithms that attempt to solve search problems by modeling them after processes found

in natural evolution. Inspired by the works of Charles Darwin, computer scientists

began to research methods by which to mimic the evolutionary processes for solv-

ing mathematical problems. The processes evolutionary processes underway which

promote good qualities in species and suppresses undesirable traits can be emulated

in computing, and can be used to e�ciently optimize algorithms—especially those

concerning complex search spaces.

The first recorded examples of modeling evolutionary principles to solve com-

putation problems is found in work by Friedberg et al. in the late 1950s, which was

30

concerned with “finding a program that calculates a given input-output function” [24].

Bremermann’s work in 1962 showed early use of “simulated evolution” for the task

of numerical optimization functions [24]. In the mid-1960s, Lawrence Fogel and John

Holland published groundbreaking, established research in evolutionary programming

and genetic algorithms (respectively), setting the stage for the formalization of this

subfield of machine learning [24]. Since then, Evolutionary Algorithms have been ap-

plied to a wide arrange of optimization problems in numerical methods, engineering,

and computer security.

3.2 Genetic Algorithms

The term Genetic Algorithm describes the subset of Evolutionary Algorithms that

mimic evolutionary conventions to solve optimization problems. John H. Holland,

widely considered the “Father of Genetic Algorithms”, was inspired by the works of

Darwin, and the ability of natural evolution processes to find solutions to biolog-

ical problems. Genetic Algorithms attempt to solve problems by first establishing

a group of solutions (population), which in essence, represent the “gene pool” of a

given solution space. For each individual candidate solutions (chromosome), a fit-

ness function evaluates how well they solve the problem at hand (or whether or not

a correct solution has been found) and assigned a fitness value. The fitness scores

(which, calculated by the fitness function, are numerical representations of how well a

given chromosome solves the problem in question) determine selection, the process by

which a chromosome is chosen for the creation of the next generation’s chromosomes.

A new population is then created by the crossover operation, where the members

of the current population are selected and recombined with other chromosomes. In

typical scenarios, chromosomes with high fitness scores are more likely to be selected

for crossover (“survival of the fittest”), to ensure the genotype of a high scoring can-

31

didate will propagate to the next generation [45]. Pseudocode for the algorithm is

shown in Algorithm 1.

Algorithm 1 Genetic Algorithm

1: procedure Genetic Algorithm(popsize, numgens) . GA run withpopulation size chromosomes and numgens generations

2: initialize population()3: calculate fitness()4: while n 6= numgens do5: selected parents select(population)6: CPOP crossover(selected parents)7: mutate operator(CPOP )8: population CPOP9: calculate fitness()10: n n+ 111: end while12: end procedure

The success of Genetic Algorithms is established according to the Holland’s Schema

Theorem, sometimes referred to as the “Fundamental Theorem of Genetic Algo-

rithms”. Mitchell remarks that its popular interpretation states that “short, low-

order schemas”, which are groups of characteristics found in chromosomes, “whose

average fitness remains above the mean will receive exponentially increasing numbers

of samples...over time” [45]. Schemas describe a set of strings with common values at

certain positions, and represent the presence (or absence) of sub-components within a

set of chromosomes. This theorem describes the power of the crossover and mutation

operators of Genetic Algorithms to propagate good information, and undergo enough

deviation via mutation (referred to as population diversity) to guide search space in

an intelligent manner [45]. Although the Genetic Algorithm is designed to calculate

fitness for entire chromosomes, the implication is that the building blocks of those

chromosomes (schemas) are being evaluated as well, in a phenomenon referred to by

Holland as “implicit parallelism” [35,45]. Mitchell clarifies that the e↵ect of selection

based on fitness leads to a gradual preference towards instances of schemas with above

average implicit fitness scores [45]. This is the basic explanation for why Genetic Al-

32

gorithms excel at optimizing certain search space problems: by managing a pool of

solutions with enough schemata (i.e., building blocks) represented, the propagation

of new chromosomes via selection and crossover (based on fitness scores) will pass on

schemas with high fitness scores. When these high-performing schema are combined

with other high-performing schema, the likelihood of happening upon an optimal

solution is increased. Finally, mutation ensures that a gene pool properly satisfies

diversity requirements necessary to explore possible solutions. In this way, Genetic

Algorithms are useful for performing intelligent test case generation for fuzzers—by

implicitly recombining groups of substrings (schema), it is possible to generate unique

solutions, even in a multimodal search space.

3.2.1 Genetic Algorithm Components

The basic operators of Genetic Algorithms seek to mimic biological phenomenon ob-

served in natural life processes. The selection operator determines the manner in

which chromosomes are selected for reproduction operations that create the following

generation of candidate solutions [45]. Selection processes are typically informed by

fitness scores of individual chromosomes—chromosomes that have high fitness scores

are more likely to reproduce, which follows the “survival of the fittest” motif in biolog-

ical evolution [45]. For some nonstandard Genetic Algorithms such as CHC, selection

is performed in a pure random fashion [29]. However, most algorithms let fitness in-

fluence the manner in which chromosomes are selected for reproduction—a common

method (and the one used in the proof of concept for this thesis) is called elitism,

which means that the strongest chromosomes are always selected for reproduction [45].

This method ensures that the schemata found in high-fitness chromosomes live on to

future generations.

Once selection chooses a pair of parents, the crossover operator is the method by

33

(a) Single-point crossover (b) Multi-point crossover (c) Uniform crossover

Figure 3.1: Three traditional crossover methods for creating new chromosomes [3]

which new children are created for the next population [45]. The e�ciency of crossover

methods is largely contingent upon the task at hand—single-point crossover, for ex-

ample, merely selects a spot between two chromosomes, and build two children from

the combination of one parent’s first half and the other’s second [45]. Other methods

still include multipoint crossover and uniform, which simply involves taking half of

the di↵ering bits between two parents. These methods are shown in pictographic form

in Figure 3.2. Holland’s Schema Theorem supposes the power of Genetic Algorithms

comes from crossover’s way of propagating building blocks of above-average fitness

schemata to future generations [35].

Finally mutation is the process by which a given value in a chromosome is ran-

domly changed, analogous to chromosome mutations found in biological processes [45].

Mutation can involve flipping the values at certain bit positions of a chromosome, or

values for other types of chromosome encodings. Mitchell describes mutation as an

“insurance policy” against particular chromosome values being fixed and never being

evaluated as a candidate for change [45]. Holland posits that mutation is required to

maintain diversity across positions in a given chromosome representation.

34

3.3 CHC

CHC is a nonstandard version of the pure Genetic Algorithm developed by Eshel-

man [29]. The method was developed to counteract the main disadvantage to which

Genetic Algorithms are dispose: in multimodal search spaces, GA’s will often fixate

on a local optima and cease to continue searching. The steps of the CHC algorithm

are slightly di↵erent than regular Genetic Algorithms: crossover only occurs when the

di↵erence between two selected parents is high enough [29]. The CHC algorithm re-

quires the crossover operation technique to be Half-Uniform Crossover, which means

that half of the di↵ering bits of two parents will be swapped during crossover. New

generations are created from the highest n chromosomes between the parent popula-

tion and the children. Over time, the chromosomes will all begin to have the same

encoding, and no more children will be created. When that threshold is hit enough

times, a cataclysmic mutation operator is invoked [29]. This form of mutation takes

the chromosome with the highest fitness and, using it as a template, creates new

chromosomes by mutating 35 percent of the selected chromosome’s encoded bits [29].

Pseudocode for the algorithm is shown in Algorithm 2. Because CHC has a built-in

convention by which to escape plateaus in local minima or maxima, it tends to ex-

haust more search space than traditional Genetic Algorithms. This makes it a good

candidate for finding solutions multimodal search spaces. A distinct disadvantage of

CHC, however, lies in its tendency to spend less time in a given search area than

traditional Genetic Algorithms, leading to potentially missed solutions.

3.4 Problem Domain

Evolutionary algorithms have been applied as an optimization strategy for a wide

variety of computing tasks. Holland’s seminal work demonstrates the use of Ge-

35

Algorithm 2 CHC Algorithm

1: procedure Genetic Algorithm(population size, numgens)2: initialize population()3: threshold = L/4 . L is chromosome length4: while n 6= numgens do5: for i in population size/2 do6: select parents p1,p2 without replacement7: if Hamming(p1, p2) > threshold then8: CPOP HUX(p1, p2) . Half Uniform Crossover of p1,p29: end if10: end for11: if sizeof(CPOP ) == 0 then12: threshold -= 113: else14: calculate fitness(CPOP )15: Population equals best N individuals from (Population + CPOP)16: end if17: if threshold < 0 then18: Population cataclysmic mutation(Population)19: threshold L/420: end if21: n n+ 122: end while23: end procedure

netic Algorithms to solve the famous Prisonner’s Dilemma [35, 45]. In the Prisoner’s

Dilemma, two individuals are detained for colluding in criminal activity, and are

held in two separate cells with no communication activity [45]. The authorities o↵er

each prisoner the following deal: if a confession is given, and cooperation to testify

against one’s partner is consented, then the punishment doled out for the crime is

lessened [45]. However, if both parties admit to their crime and testify, the leniency

previously o↵ered is nullified. If neither testify against one another they will each

receive a moderately intense jail sentence [45]. Axelrod sought to determine whether

or not Genetic Algorithms can help decide the best strategy for each individual pris-

oner (which many tournaments showed was simply “TIT FOR TAT”, or a repetition

of the choice made by the other prisoner) [45]. Given the proper conditions, Axelrod

showed that Genetic Algorithms were able to find solutions which scored higher than

“TIT FOR TAT” [45]. This demonstrates the somewhat inexplicable ability of Ge-

36

netic Algorithms to propagate building blocks of good solutions to create new ones

which humans may not consider.

An example use of Genetic Algorithms to solve engineering problems is found in

Hornby et al.’s implementation of the algorithm to automatically perform antenna

design [37]. Before, antenna design was done manually, and consumed a great deal of

human design resources—the design of antennae requires an expert because of the vast

amount of knowledge necessary to produce quality designs [37]. In response, Hornby

et al. implemented an Evolutionary Algorithm which found novel antenna designs

that outperformed human generated solutions, according to the voltage standing wave

ratio and gain values of frequencies [37]. Evolutionary Algorithms have been applied

to problems as varied as financial portfolio optimization, game-theoretic problems as

described in the Prisoner’s Dilemma, and even the development of walking methods

for computer figures [21, 31]


Genetic Algorithms are useful for optimizations for a wide variety of problems and

domains. One of the main advantages of these algorithms is described in the Schema

Theorem previously discussed [35,45]. Genetic Algorithms excel at search space opti-

mization with nondeterministic solutions. Genetic Algorithms are also easy to concep-

tualize and implement, so once design decisions are established, Genetic Algorithms

are simple to incorporate into a variety of optimization schemes. The vast set of

parameters involved in tuning Genetic Algorithms is both a blessing and a curse. De

Jong remarks that often times, poorly tuned parameters do not create suboptimal

results [24]. This can lead to a great deal of frustration, however, when underperfor-

mance is observed, as parameter tuning does not necessarily map deterministically to

improved results.

37

The main disadvantage of Evolutionary Algorithms—and, really, any class of op-

timization algorithms—is based on the “No Free Lunch” (NFL) theorem [60]. Simply

put the NFL theorem states that there “cannot exist any algorithm for solving all

(e.g. optimization) problems that is generally (on average) superior” to any other op-

timization algorithm [24,60]. Another disadvantage concerns the fact that stochastic

processes, which are at the center of many Evolutionary Algorithms, rely on random

number generation, and the bias associated with potential incorrect pseudorandom

number generation can lead to problems. Furthermore, many search landscapes are

multimodal, meaning more than one optimal solution exists. Evolutionary algorithms

often have trouble in multimodal search spaces [24]. However, heuristics-based mea-

sure can be taken to ensure reasonably good search performance. The complexity of

chromosome representation and Genetic Algorithm operators can become unwieldy

and ine↵ectual without proper constraints [45]. This research demonstrates the e↵ects

of that consideration, as CHC underperforms because of its ine�cient implementa-

tion for the constraints of variable length chromosomes with complex representations.

Finally, accurate and precise fitness functions can prove di�cult to formulated given

the nature of many real world problems [45].

All told, Evolutionary Algorithms are a useful optimization strategy for certain

types of search space problems, and have a direct, positive results when incorporated

with the data generation aspect of fuzz testing frameworks. The remainder of this

research will concern their application to fuzz testing of web applications, with specific

focus on evolving payloads to exploit SQL injections.

38

Chapter 4: Evolutionary Algorithm

Web Fuzzing Framework

4.1 Approach

As previously discussed, the use of Evolutionary Algorithms to intelligently reduce

the search space for fuzzing campaigns has been proven e↵ective across a wide range

of targets [25,42,56]. Genetic Algorithms are useful for guiding the manner in which

input should be crafted, which can be modeled well as a search algorithm, making

them a good candidate for optimization. Sparks et al. modeled their chromosomes as

productions of a grammar which created a series of opcodes, used to uncover vulnera-

bilities in an FTP program [56]. In the web application sphere, Duchene et al. found

success revealing Cross-Site Scripting (XSS) vulnerabilities through a combination of

taint analysis and an evolutionary algorithm whose chromosome representations were

grammar productions of an “attack grammar” for Cross-Site Scripting (XSS) [26].

One of the limitations of this approach, however, is that their strategy required an

expert to manually write the “attack grammar” used to generate payloads for their

Evolutionary Algorithm [26]. This research explores techniques by which to auto-

matically derive grammars for an attack language by analyzing the lexical structures

of positive examples, and curating a set of productions which represent every string

found in the corpus. The goal is to amalgamate a group of grammar production

rules—which are grouped together based on a “fingerprinting” [30] algorithm for iden-

tifying SQL Injections examples—and use those to score fitness and/or to represent

chromosomes according to production rules.

39

4.1.1 Preprocessing

The purpose of the preprocessing phase of the EA fuzzing framework is to build a set

of attack grammars which encompass the lexical structure of the positive examples,

record the frequency of n-tuple groups of SQL tokens in the corpus, and to find the

frequency of transitions between n-tuple groups of tokens in the positive examples.

Analysis of positive examples of SQL injections allows the set of attack grammars

available to our Genetic Algorithm to be constructed. First, positive examples are

procured: the sample corpus for this set of experiments comes from Søen’s “Forced

Evolution” database, and from Click Security’s “Data Hacking” repository [46, 55].

The elements of the corpus were chosen in order to cover a wide range of di↵er-

ent SQL injection attacks, including boolean and UNION-based [10]. Boolean-based

SQL injections attempt to insert a boolean statement into a SQL query that will

always evaluate to true, thereby returning (exfiltrating) data from SQL queries that

should not be returned. UNION-based SQL injections, on the other hand, attempt to

match the output structure of a given SQL query to exfiltrate data from other tables,

server-specific values, or other sensitive information. Galbreath’s training set of SQL

injections were used in early stages of the project, but were not used for the experi-

ments outline in this document [30]. Although the set of examples from Galbreath’s

libinjection library were high quality, they favored UNION-based attacks too heavily

for our purposes, and represented more fingerprints for which the framework could

perform fitness calculations than could be used in a reasonable amount of time.

Once the corpus has been curated, the preprocessing stage kicks o↵ by lexing each

positive example into its SQL token representation. This research uses sqlparse, a non-

validating SQL lexer/parser [16]. Instead of writing a parser for our purposes, sqlparse

was chosen because it does not require a valid SQL string, and has robust tokenization

capabilities. The first quality mentioned is especially important because our positive

40

examples are merely fragments of SQL statements that represent malicious intent.

tokens are arranged into one, two, or three-tuples groups, and assigned a production

rule in the attack grammar according to their position within the original positive

example (a figure explaining this process is shown below). The frequency of the n-

tuple groups are recorded, and used for fitness metrics. In addition, the frequency

of transitions between n-tuple groups are recorded for use by the fitness function

as well as the Markov Model Monte Carlo algorithm. The information used by the

Evolutionary Algorithms tested in this research is grouped into one, two, and three-

tuples SQL token. A SQL token merely represents the symbolic value of a literal string

according to the SQL language specification. The reason three-tuples were chosen as

the maximum group of lexical tokens for a given terminal is based on research by

Mike Sconzo and Brian Wylie, whose work on data science for security was shown in

proceedings at Shmoocon in 2014 [46]. They demonstrate that 3-gram groupings of

SQL tokens carries enough information to determine whether or not a given string has

malicious intent [46]. Although they were approaching the problem of SQL injection

detection, the idea pertains to fuzzing as well: instead of relying on a human to write

an attack grammar based on known types of injections, the approach of this paper

requires tokenization, and then a grouping of these tokens based on their position.

The idea is that the Genetic Algorithm will be able to move these n-tuple groups

in di↵erent orders (via crossover) while still preserving attack grammar information.

A visual representation of an extract of the production tree is in Figure 4.2. After

the positive examples have been broken down into their semantic tokens, n-tuple

groupings, and n-tuple transition densities, the attack grammars that represent the

corpus are constructed. A visual manifestation of this process is shown in Figure 4.1

41

Corpus ofPositive Examples

Lex andTokenize Pos-itive Example

Build n-tuplesTable

Record MarkovTransitions

Apply Fingerprint

Construct/UpdateGrammars

Figure 4.1: Flow graph of preprocessing stage

42

4.1.2 Attack Grammars

The final phase of preprocessing involves creating a set of grammars whose produc-

tions end in nonterminals represented as n-tuple groups of SQL tokens. This forms

the basis of the proposed method’s chromosome representation and fitness function

calculations. The attack grammars are separated based on the “fingerprint” value

of the positive example, as calculated by Galbreath’s libinjection software [30]. A

fingerprint is calculated by approximating the type of SQL injection attack based on

the tokens that are present in a given string [30]. This ensures that exploit strings

that have similar structural components are grouped together, and their grammar

productions are grouped accordingly.

More formally, the algorithm derives a set of grammars that represent our corpus

of positive examples:

Attack G = {G0, G1, ... Gn�1} , (4.1)

where each grammar G contains production rules that generate strings that have the

same fingerprint, which classifies them to according to the semantic structure of one

or more positive examples. Formally, each grammar is a 4-tuple:

Gi = {V,⌃, R, S} (4.2)

V is a finite set of non-terminals (variables). In this research’s proof of concept im-

plementation, the variables correspond to positional indices where groups of n-tuples

are represented. For a given index of a fingerprint, multiple n-tuples are potential

productions for the grammar. S is a set of terminals which are the actual components

that comprise a valid string of a given language described by the grammar. The ter-

minals in this implementation are one, two, or three tuples of SQL tokens. S is the

start variable, and R is the set of production rules from S that derive terminals [54].

The production rules are purposefully crude in order to limit the time spent deriving

43

start

fp0

0fp0 1fp0 2fp0

(SINGLE, DDL, PUNCT ) (DML, ERROR)(INT, COMPARISON, INT )

(m� 1)fp0

fp1 fpn�1

Figure 4.2: Example extract of Parse tree derived from positive examples of SQLinjection tokens

strings of a given grammar, and for use in exploring the e�cacy of using sets of simple

grammars to approximate an attack language. Each grammar can be described as

follows:

Gfp =�Gfp(s)|s 2 LGfp

and G accepts s

(4.3)

The set of grammars themselves do not seek to accurately encompass the SQL lan-

guage specification—instead, the goal is to approximate the structures of positive

examples well enough to codify the semantic components of an attack language—in

the language of genetics, the phenotypical information available.

Value Fingerprint SQL Token Representation Grammar Productions

’ x OR 1 = 1’ sn&10 Error Name Keyword Inte-ger Comparison Integer Sin-gle

S ! 0 1 20! Error Name Key-word1 ! Integer Compari-son Integer2 ! Single

Table 4.1: An example preprocessing of a positive example

At the cost of more refined expression of a given “Attack Grammar”, such as those

found in Duchene et al., this technique aims to collect various permutations of lexical

44

symbols which represent “known bad” injection attempts (i.e., the semantic struc-

ture of examples of SQL injections) and place them in their positional context [27].

In use with the fuzzing framework, chromosomes are modeled as production for a

given fingerprint, producing a group of one, two, or three sequential tokens. This

design decision ensures that the Genetic Algorithm searches the input space with

genotypic components (tuples of SQL tokens derived from positive examples) that

are well enough preserved for focused searching. The heart of this research involves

exploring whether or not a precise attack grammar is required, or if it is su�cient to

encode shallow productions of grammars, grouped by a common lexical structure, and

allow for an Algorithm to recombine productions of di↵erent fingerprints to produce

new exploit strings. These new exploit strings will sometimes not have representative

fingerprint, and other times will conform to ones available in libinjection. The re-

sults demonstrate that their is value in this approach, especially since it is completely

automated. In theory, provided a lexer for a given target language is available, and

a method for codifying similar examples into fingerprints, it is possible to use this

framework for any type of fuzz testing campaign. Future research will explore us-

ing this framework to find Cross-Site Scripting (XSS) vulnerabilities and memory

corruption vulnerabilities in local binaries.

4.1.3 Fitness Evaluation

For this technique, the fitness function is a combination of three characteristics of a

given candidate solution. First, if a given chromosome successfully achieves a SQL

injection, it is heavily promoted within the population. In a related matter, fit-

ness scores of chromosomes which result in an invalid SQL statement are suppressed.

This begs the question of the proof of concept’s tenability against real-world systems.

While most web fuzzing campaigns are purely black box, it is not unreasonable to an-

45

alyze input forms and determine the type of SQL statement executed. Furthermore,

the current implementation scores this condition very weakly, to the point where it

could be removed without lasting e↵ect. The chromosome in question is scored based

on how well it conforms to the attack grammars built by positive examples. The

tokens of the chromosome in question are compared against the terminals of each

grammar at the corresponding positional indices. Instead of only denoting if a chro-

mosome is accepted or rejected by a grammar, if a given chromosome’s token groups

match a successive sequence of a grammar’s token groups, the fitness score is com-

pounded exponentially. In short, a chromosome that matches a contiguous grouping

of tokens fits a high portion of a grammar potential terminals, and is exponentially

promoted within the population (i.e., a given chromosome does a good job of ap-

proximately representing an attack language). This idea can be summarized in the

formula in figure

n�1X

i=0

m�1X

j=0

(xj == Gi,j) ⇤ k2 (4.4)

where n represents the total number of fingerprints, m represents the positional token

groups in the symbol representation of the chromosome, and k represents the number

of sequential matches found. k is reset to 0 in the event that a mismatch is found. This

formula, compounded with the other two metrics, comprises the fitness calculation

used for the experiments of this research.

4.1.4 Niche-penalty Heuristics-based Genetic Algorithm

The preprocessing previously discussed is used as heuristic information for intelli-

gently guiding input generation for a web application fuzzer. For intelligently guiding

input generation, thesis proposes the use of a niche-penalty, heuristics-based Genetic

Algorithm. The chromosomes of this approach are represented as sequences of 2-

46

tuples, which corresponds to an attack grammar (fingerprint), and a positional index

by which to select an n-tuple of SQL tokens. The selection method chosen for this

approach is pure elitism, which probabilistically selects parents based on probability

densities informed by fitness scores. Three crossover methods are evaluated (single-

point, two-point, and uniform), and compared against each other. The results show

uniform crossover to be the most e↵ective, which follows considering that the in-

dividual elements of a chromosome codify a great deal of information (up to three

tokens). Mutation rates of 0.1 was chosen in order to provide the algorithm with

enough chaos to avoid reaching a plateau too quickly, while still remaining within

conventional limits. Therefore, the results of this algorithm using uniform crossover

and a 0.1 mutation probability are the represented candidate in comparative analysis.

The tendency for Genetic Algorithms to plateau on local optima is especially

concerning for the search space in question—valid injection strings create a multi-

modal search landscape, so it is vital to encode the Genetic Algorithm with the tools

necessary to reinitialize its search direction. Based on trial and error, this proof of

concept measures the number of times in which the mean fitness for the population

demonstrates a downward trend, and after an accumulation of enough “strikes”, the

population is reinstantiated via the grammars created during preprocessing. Figure

5.2. It is clearly visible that the mean fitness per population takes a sudden dive at

distinct intervals. These events occur when the algorithm determines that a niche has

caused the Genetic Algorithm to plateau on a given value (or set of similar) values.

Figure 4.3 shows a top-level flow-chart graph of this method.

4.1.5 CHC

In theory, CHC is a perfect candidate for use in guiding test case generation for

fuzzing frameworks because it has the same structured operators as standard Ge-

47

netic Algorithms, but also includes a built-in mechanism by which to escape search

space plateaus. Seagle used CHC with tremendous success guiding test case gen-

eration for file format fuzzing [53]. The common implementation of CHC assumes

each chromosome uses binary encoding with fixed-length chromosomes [29, 53]. This

creates a natural situation in which a population will plateau on a given value, and

stop producing new children (referred to as “incest penalty” in the literature) [29].

The proof of concept for this research uses variable-length, multi-value encodings for

chromosomes, representing a problem when implementing CHC. The limitation is

that there is an encoding disconnect, in that two productions of di↵erent fingerprints

produce the same n-tuple of SQL tokens. This occurs because the productions of a

grammar generate one, two, or three-tuples of SQL tokens, and there can be matches

of these tokens at certain positions across various grammars. Previous experiments

with CHC using conventional methods on chromosomes with grammar production

representation showed that cataclysmic mutation rarely occurred, despite a plateau

being reached.

The workaround for this involves the following steps. First, in order to combat the

penalty of hamming distance metrics on variable length chromosomes, the Hamming

distance is calculated with the tokens of the minimum length (e.g., if parent 1 has

10 tokens, and parent two has 15, the first 10 tokens are compared in the Hamming

distance). This is subtracted from the Hamming distance calculated between the pro-

duction encoding of the two chromosomes. The production encodings are considered

the same if they represent the same positional index, and if the two fingerprints share

similar encoded components. This is measured by calculating the set intersection

between the individual characters of the fingerprint. Lastly, the threshold is given a

fixed value, and a “countdown” variable records the number of generations where no

children were generated. Once the countdown is less than zero, cataclysmic mutation

48

is executed. Although the modifications to CHC deviate from the algorithm’s con-

ventional conditions, the implementation supported by this research was able to find

valid exploit strings against the vulnerable service.

Figure 4.3: A flowchart of the heuristics-based Evolutionary Algorithm fuzzing frame-work proof-of-concept


The chief advantage of both using Evolutionary Algorithms with grammar-based

heuristics for chromosome representation and fitness calculation is that it provides

the fuzzer with guidance for input generation in a situation that has a very limited

feedback loop. Instead of relying on the search space covered by a manually created

corpus, this algorithm allows for intelligent searching of new exploit strings based on

the fundamental components of known malicious test cases. Another advantage not

explored by this research is the ease by which this approach can be parallelized. A

49

distributed system would remove resource constraints of the current implementation,

allowing for much larger population sizes, and more refined fitness calculations.

The most significant limitation of the proof of concept implementation is the run-

ning time. Because of the expensive fitness calculation, and the network bound limits

imposed by the input execution process between a program and a web service, the

algorithm takes a significant amount of time to complete execution. While the latter

limit can be minimized by replicating the target in a virtual machine and testing

locally, the former is unavoidable—in order to surmise how well a given chromosome

“fits” the approximated attack language, it must be compared against the produc-

tions represented by the grammars. A significant portion of this research involves

determining the heuristics involved in recognizing a plateau state reached by CHC or

the Genetic Algorithm. The current manifestation of those heuristics could be con-

sidered too crude—an area of open research would involve refining those heuristics

based on each particular System Under Test (SUT).

50

Chapter 5: Experimental Results

5.1 Testing Environment

In order to assess the e↵ectiveness of an EA-based web application fuzzer, whose

fitness metrics and chromosome representation center on positive examples of known

SQL injection attempts, an intentionally vulnerable web application was instantiated

for the purpose of testing. Trustwave’s SpiderLabs security research group created

a testbed of vulnerable web applications called the “Magical Code Injection Rain-

bow” [15]. For these experiments, the process of crawling an application seeking a

potentially vulnerable input vector is bypassed to focus on the e�cacy of the algo-

rithms for evolving payloads. That said, Duchene et al. proved the usefulness of

model-taint analysis for guiding fitness metrics, and it will be an actionable goal for

future work [26]. All the results were derived from tests run on a Macbook Air, in

a closed loop against MCIR’s “SQLol” vulnerable testbed, running as a service pro-

vided in OWASP’s “Broken Web Application” Virtual Machine [7]. This testbed was

chosen because of the simplicity of modeling di↵erent types of vulnerabilities related

to SQL injections, and for its highly configurable parameters [15]. The front page of

the vulnerable service in discussion is shown in Figure 5.1.

5.2 Benchmark Simulation

5.2.1 Random

The lower baseline used for comparing the e�ciency of niche penalty heuristic-based

Genetic Algorithm and CHC is a simulation where each token of a chromosome are

randomly chosen. The chromosome are in pure random searching are modeled solely

51

Figure 5.1: The front page of the testbed used for measuring the e↵ectiveness ofniche-penalty GA-based web fuzzing [15]

on the SQL token representation. SQL tokens produce values by randomly selecting

from a choice of values for that token (based on preprocessing of positive examples).

Each chromosome is subject to the same fitness metrics as the other simulation types.

This benchmark is necessary to ensure that intelligent search-space algorithms are

not vastly under performing, and that the grammatical information used in fitness

calculation reasonably outperform chaotic token selection.

5.2.2 Markov Model Monte Carlo

The final comparative method for web application SQL fuzzing is based on a Markov

Model Monte Carlo implementation of population building. Chromosomes of vari-

able length are instantiated according to a Markov-transition lookup table generated

during preprocessing: the n-tuple transitions of tokens in the positive examples are

weighted according to frequency. The steps of chromosome instantiation are as fol-

lows. First, a given transition from

SQL TupleA ! SQL TupleB (5.1)

52

is selected in pure random fashion. The following transitions from:

SQL TupleB ! SQL TupleC ...SQL Tuple(n� 2)! SQL Tuple(n� 1) (5.2)

are selected according to Monte Carlo probabilistic selection based on densities of the

transition frequencies. Chromosome instantiation terminates when a given tuple has

no more transitions, or the maximum chromosome length for the trial run is reached.

The value representations of chromosomes are chosen the say way in which the GA

and CHC implementations derive values—each SQL token has a set of values with

frequencies. For each token, a weighted random choice selects a value. This represents

the approach to input generation that is chiefly concerned with generating chromo-

somes that represent popular transitions between n-tuples from the corpus. The idea

is that by conjoining a sequence of well-represented transitions, the probability of

finding a valid SQL injection is increased. The results demonstrate that this is a well-

reasoned observation, as it outperforms the proof of concept CHC implementation by

a wide margin.

5.3 Evaluation Metrics

The heuristics-based GA and CHC are compared to Random simulation, Markov

Model Monte Carlo, and a corpus-based simulation that simply sends each positive

example to the test bed (a baseline measuring stick, of sorts). Each experiment

contains 20 runs of a given simulation and parameter setting, using the same 20 seed

values across each di↵erent experiment. The population for each simulation type

includes 20 chromosomes, and each trial run executes the corresponding simulation

type for 200 generations. The number of trials per simulation type is selected in

order to reduce the variability introduced by the proof of concept’s use of random

to generate literal values from SQL tokens. The heuristics based GA, CHC, and

53

Markov Model Monte Carlo methods both operate with SQL tokens instead of raw

values, so variability regarding the manifestation of a given test case leads to some

inconsistencies. Because the notion of code coverage is untenable given the scenario

environment, each algorithm is evaluated according to two metrics:

1. the number of exploits found during the trial run

2. the average fitness per generation

Fitness is calculated for each chromosome the same way across each simulation type.

Di↵erent crossover types (single-point, two-point, and uniform) for the heuristics

based GA were evaluated, and Half Uniform Crossover was used for the CHC im-

plementation. It is determined that an exploit string is found when the contents of

the database table in question are dumped. This requires knowledge that would not

be available in a black box setting. However, it could be simulated—if a certain num-

ber of rows are returned, and it is significantly di↵erent than the contents of normal

queries, one can ascertain that an injection string was found.

5.4 Result and Analysis

5.4.1 Fitness and Diversity

Figure 5.2 shows the average fitness of each simulation type across the 20 trial runs

of 200 generations. Each line corresponds to the mean fitness per generation for each

simulation type. The CHC function had the highest overall fitness score, followed by

the heuristics-based Genetic Algorithm, Markov Model Monte Carlo, and Random.

The flat line in the middle of the graph is the fitness score of the corpus of positive

examples, which is the fitness score of sending the positive examples as input to the

fuzzing framework. The high mean fitness score of CHC can be attributed to the cat-

aclysmic mutation operator, which instantiates an entire population based on a copy

54

of the highest performing chromosome, whereas the heuristic-based GA reinstantiates

the population by building chromosomes based on attack grammar productions. The

fitness scores of the simulation types across the generations indicates that the EA

based methods have a higher chance of finding valid exploit strings. The observations

indicate that this is only partially true—Markov Model Monte Carlo proved to be

a very consistent method for finding valid exploit strings, yet does not demonstrate

very high mean fitness scores. This indicates that the fitness calculations need to

be more refined in order to accurately assess the true fitness of a chromosome. This

will be a primary subject of future work, as the e�cacy of the EA-based algorithms

are directly influenced by the fitness function’s validity. The diversity of symbol

Figure 5.2: mean fitness per generation

and value representations per population for both CHC and heuristics based GA are

shown in Figure 5.3, and individually for Random and Markov Model Monte Carlo

55

Figure 5.3: median diversity of value and symbol representations per generation forGA and CHC

56

(a) Median Diversity of Value and Symbols per Population for Random Simulation

(b) Median Diversity of Value and Symbols per Population for Markov Model Monte Carlo

Figure 5.4: median diversity of value and symbol representations per generation forRandom and Markov Model Monte Carlo

57

(a) Total number of unique exploits found in 3 experi-

mental trials (20 runs per trial)

(b) Average number of exploits per

trial simulation

Figure 5.5: Total unique exploits per simulation and average number of exploits pertrial

process in Figure 5.4. Diversity was measured for each generation using Python’s

di✏ib module, on both the value representation (actual payload generated) and sym-

bol representation (the SQL tokens that represent a payload). The smoothing trend

shown in Figure 5.3 demonstrates the phenomenon of the algorithm focusing in on

a particular solution for generating exploit strings. The pattern of fluctuation are

caused by the restart heuristics for the GA and CHC methods—in order to escape

plateaus, a necessary condition for searching multimodal spaces, conventions for re-

instantiating populations are required. Figure 5.4 demonstrates the fluctuation in

diversity when search space guidance is not in place.

5.4.2 Exploits Found

The Genetic Algorithm with niche-penalty heuristics is shown to produce a high

number number of unique exploit strings for a given experiment of 20 runs. Across

each simulation, no trials generated an exploit string that was identical to one found in

the original corpus of positive examples. Figure 5.5 shows the total number of exploits

found per simulation type across the trials, as well as the average number of exploits

found per trial for each simulation. Although the GA method of produced the most

58

Figure 5.6: Average number of exploits found in each generation

valid exploit strings for our testbed, the Markov Model Monte Carlo algorithm for

producing payloads was more consistent—every single trial run of the Markov Model

Monte Carlo simulation found a valid exploit string, whereas the other methods each

had at least one trial finish unsuccessfully. Figure 5.6 shows the average number of

exploit strings found per generation, which is an average of all the trials for a given

simulation. It is clear that both CHC and heuristic based GA, when happening upon a

valid injection string, do a good job of zeroing-in on the structure of that chromosome,

producing a hight number of exploit strings in successive generations. The dearth in

exploits found between pockets of high success in CHC demonstrates the lack of

consistency with which the algorithm determines cataclysmic mutation. This may be

insurmountable given the conditions of variable length chromosomes which encode

production rules. While CHC found more exploit payloads than the random method

59

trials, the CHC trials produced 20 total duplicate strings from previous generations,

likely due to the fact that our modified Hamming distance metric and threshold

countdown were too sensitive, allowing for high performing strings to live to the next

generation.

SimulationType

PopulationInitializer

ChromosomeRepresentation

CrossoverMethod

TotalUniqueExploits

GA Grammar Production uniform 272MMMC NA Symbol NA 157CHC Grammar Production NA 70PositiveExamples

NA NA NA 39

Random Random Symbol NA 6

Table 5.1: Total exploits per simulation

SimulationType



CrossoverMethod

TrialsWithoutExploit

GA Grammar Production uniform 2MMMC NA NA NA 0CHC Grammar Production NA 17Random Random Symbol NA 16

Table 5.2: Number of trials per simulation type without an exploit

SimulationType



CrossoverMethod

Max Ex-ploitsFound

GA Grammar Production uniform 184MMMC NA Symbol NA 13CHC Grammar Production NA 27Random Random Symbol NA 2

Table 5.3: Highest number of exploit strings found throughout a singular trial

For the GA and CHC algorithms, the average number of exploits per trial are

skewed— a small number of trial cases account for most of the exploit strings found

in a given experiment, and many of the runs did not find any at all. This has less

60

impact upon the results for the heuristics based niche-penalty GA, as most of the

seeded trial runs found at least one unique exploit string. Markov Model Monte

Carlo was consistent, as every run found at least one valid injection string. Table 5.1

shows the total number of unique exploits found across all trials for a given simulation

type. As previously mentioned, the experimental simulations did not generate any

valid exploit strings identical to one found in the original corpus of positive examples.

The number of trials with zero-exploits found per simulation type is summarized

in table 5.2. A reasonable question regarding the e�cacy of the niche-penalty GA

method can be drawn—having such a high-performing outlier skews the results of the

overall e↵ectiveness of the algorithm. When the high performing trial is removed,

the simulation performs no better than Markov Model Monte Carlo. While this is a

reasonable claim, the fact that the proposed GA method zeroes in so well upon a set

of semantically similar yet unique exploit strings demonstrates the potential of this

method for e�cient application level fuzzing, and points to the necessity of developing

this method further.

61

Chapter 6: Conclusion and Future Work

The research presented in this thesis demonstrates the e↵ectiveness of fuzzing

frameworks guided by Evolutionary Algorithms, and the improvements to be gained

by using fitness metrics and chromosome representations modeled after the structure

of positive injection examples. As opposed to manually writing “attack grammars” for

a given language input class, generating a set of shallow grammatical representations

of known nefarious injection strings is shown to improve the Evolutionary Algorithm’s

search for strings, despite the multimodal search space of valid exploit strings.

Although the heuristics based, niche-penalty Genetic Algorithm found the most

valid SQL injections, it had some inconsistencies—for some of the trials, no exploit

strings were found despite a vulnerability being present. This indicates that fitness

function metrics should be revisited, or the framework itself should rely upon a more

informative feedback loop. It is possible that the reinitialization heuristics were too

simplistic and/or sensitive—this will create a situation in which the GA will not have

enough time to explore a subspace that contains a valid injection. This is point is

further evidenced by the fact that, although CHC had the highest mean fitness scores

per generation, it had the lowest e�cacy of the two evaluated evolutionary algorithms.

Another key limitation of the current proof of concept code is run time—the av-

erage execution time is long enough to raise questions of production-level quality.

A distributed system for calculating chromosome fitness and the process of send-

ing input and monitoring results would result in a significant performance speedup.

Theoretically, this framework can be extended to any sort of search space with pos-

itive examples that can be tokenized into lexical groups. Therefore, an open area

of research extending outward from this work would involve ensuring extensibility to

62

web-based attacks such as Cross-Site Scripting (XSS). Other target scenarios, such

as file format parsers and language interpreters are also fertile areas for research and

testing.

The most important next step is to fit this framework for testing web applications

for Cross-Site Scripting (XSS) vulnerabilities—they are much more common today,

and the feedback loop allows for higher quality fuzzing heuristics and fitness function

metrics. Further research will also make use of modern in-memory fuzzing methods

and GA heuristics based on execution paths in conjunction with our grammar-based

function methods.

In summary, the use of grammar-related heuristics in Evolutionary Algorithms

to intelligently guide payload generation for application level fuzzers is shown to

produce unique exploit strings that are based on the lexical structure of positive

examples. These results confirm that using semantic level information for encoding

chromosomes has the desired e↵ect of propagating injection information while still

searching multimodal spaces with refined information at hand. The further refinement

of fitness heuristics will be the next step in the maturation of this process, as well as

further testing with di↵erent target languages and systems.

63

Bibliography

[1] Beautiful soup html/xml parsing library for python. https://www.crummy.com/

software/BeautifulSoup/. Accessed:2014/12/09.

[2] Burp suite web intercept proxy. https://portswigger.net/burp/. Accessed:

2016/04/21.

[3] Creationwiki genetic algorithm. http://creationwiki.org/Genetic_

algorithm. Accessed: 2016/03/15.

[4] Cross-site scripting (xss). https://www.owasp.org/index.php/Cross-site_

Scripting_(XSS). Accessed:2016/04/23.

[5] Damn vulnerable web application. http://www.dvwa.co.uk/. Ac-

cessed:2013/10/23.

[6] Dan guido fuzzing introduction fall 2010. https://fuzzinginfo.files.

wordpress.com/2012/05/fuzzingintro_fall2010.pdf. Accessed:2015/12/6.

[7] Homepage for owasp broken web application. https://www.owasp.org/index.

php/OWASP_Broken_Web_Applications_Project. Accessed:2015/09/10.

[8] The microsoft sdl. https://blogs.msdn.com/blogfiles/publicsector/

WindowsLiveWriter/ReferenceMicrosoftSecurityDevelopmentLif_7279/

image_2.png. Accessed: 2016/04/17.

[9] Owasp jbrofuzz web application fuzzer. https://www.owasp.org/index.php/

JBroFuzz. Accessed: 2016/04/21.

64

[10] Owasp sql injection explanation. https://www.owasp.org/index.php/SQL_

Injection. Accessed:2016/04/24.

[11] Peach fuzzer. http://www.peachfuzzer.com/. Accessed: 2016/02/08.

[12] Python network packet crafting library. http://www.secdev.org/projects/

scapy/. Accessed:2014/09/22.

[13] Selenium browser automation. http://www.seleniumhq.org/. Ac-

cessed:2015/01/10.

[14] Smashing the stack for fun and profit. http://insecure.org/stf/smashstack.

html. Accessed: 2016/02/04.

[15] Spiderlabs magical code injection rainbow testbed. https://github.com/

SpiderLabs/MCIR. Accessed: 2016/04/21.

[16] sqlparse non-validating sql parser module for python. https://github.com/

andialbrecht/sqlparse. Accessed:2016/07/18.

[17] w3af web application security scanner. http://w3af.org/. Accessed:

2016/04/21.

[18] zzuf mutation-based fuzzer. http://caca.zoy.org/wiki/zzuf. Accessed:

2016/04/24.

[19] Sofia Bekrar, Chaouki Bekrar, Roland Groz, and Laurent Mounier. Finding

software vulnerabilities by smart fuzzing. In Software Testing, Verification and

Validation (ICST), 2011 IEEE Fourth International Conference on, pages 427–

430. IEEE, 2011.

65

[20] Josip Bozic and Franz Wotawa. Model-based testing-from safety to security. In

Proceedings of the 9th Workshop on Systems Testing and Validation (STV?12),

pages 9–16, 2012.

[21] Chi-Cheong. Genetic algorithms in portfolio optimization. Computing in Eco-

nomics and Finance 2001 204, Society for Computational Economics, 2001.

[22] Crispin Cowan, Perry Wagle, Calton Pu, Steve Beattie, and Jonathan Walpole.

Bu↵er overflows: Attacks and defenses for the vulnerability of the decade.

In DARPA Information Survivability Conference and Exposition, 2000. DIS-

CEX’00. Proceedings, volume 2, pages 119–129. IEEE, 2000.

[23] ICT Data and Statistics Division. ICT Facts and Figures 2015. 2015.

[24] Kenneth De Jong, D Fogel, and Hans-Paul Schwefel. Handbook of evolutionary

computation. IOP Publishing Ltd and Oxford University Press, 1997.

[25] Jared DeMott, Richard Enbody, and William F Punch. Revolutionizing the

field of grey-box attack surface testing with evolutionary fuzzing. BlackHat and

Defcon, 2007.

[26] Fabien Duchene. How i evolved your fuzzer: Techniques for black-box evolution-

ary fuzzing.

[27] Fabien Duchene, Sanjay Rawat, Jean-Luc Richier, and Roland Groz. Kameleon-

fuzz: evolutionary fuzzing for black-box xss detection. In Proceedings of the

4th ACM conference on Data and application security and privacy, pages 37–48.

ACM, 2014.

[28] Verizon Enterprise. Data breach investigations report. Technical report, Verizon

Communications, Inc., 2015.

66

[29] Larry J Eshelman. The chc adaptive search algorithm: How to have safe search

when engaging. Foundations of Genetic Algorithms 1991 (FOGA 1), 1:265, 2014.

[30] Nick Galbreath. libinjection software. https://github.com/client9/

libinjection. Accessed:2015/09/03.

[31] Thomas Geijtenbeek, Michiel van de Panne, and A Frank van der Stappen.

Flexible muscle-based locomotion for bipedal creatures. ACM Transactions on

Graphics (TOG), 32(6):206, 2013.

[32] Liu Guang-Hong, Wu Gang, Zheng Tao, Shuai Jian-Mei, and Tang Zhuo-Chun.

Vulnerability analysis for x86 executables using genetic algorithm and fuzzing.

In Convergence and Hybrid Information Technology, 2008. ICCIT’08. Third In-

ternational Conference on, volume 2, pages 491–497. IEEE, 2008.

[33] Tao Guo, Puhan Zhang, Xin Wang, and Qiang Wei. Gramfuzz: fuzzing testing of

web browsers based on grammar analysis and structural mutation. In Informatics

and Applications (ICIA), 2013 Second International Conference on, pages 212–

215. IEEE, 2013.

[34] R Hansen and M Patterson. Stopping injection attacks with computational the-

ory. In Black Hat Briefings Conference, 2005.

[35] John H Holland. Adaptation in natural and artificial systems: an introductory

analysis with applications to biology, control, and artificial intelligence. U Michi-

gan Press, 1975.

[36] Christian Holler, Kim Herzig, and Andreas Zeller. Fuzzing with code fragments.

In Presented as part of the 21st USENIX Security Symposium (USENIX Security

12), pages 445–458, 2012.

67

[37] Gregory S Hornby, Al Globus, Derek S Linden, and Jason D Lohn. Automated

antenna design with evolutionary algorithms. In AIAA Space, pages 19–21, 2006.

[38] Michael Howard and Steve Lipner. The security development lifecycle. O’Reilly

Media, Incorporated, 2009.

[39] Yating Hsu, Guoqiang Shu, and David Lee. A model-based approach to security

flaw detection of network protocol implementations. In Network Protocols, 2008.

ICNP 2008. IEEE International Conference on, pages 114–123. IEEE, 2008.

[40] Vincenzo Iozzo. 0-knowledge fuzzing. Black Hat DC, 2010.

[41] Dave Wichers Je↵ Williams. Owasp top 10 2013. http://www.owasp.org. Ac-

cessed: 2015/11/4.

[42] Michal Zalewski (lcamtuf). american fuzzy lop (2.10b). http://lcamtuf.

coredump.cx/afl/. Accessed: 2016/04/18.

[43] Li Li, Qiu Dong, Dan Liu, and Leilei Zhu. The application of fuzzing in web soft-

ware security vulnerabilities test. In Information Technology and Applications

(ITA), 2013 International Conference on, pages 130–133. IEEE, 2013.

[44] P. Amini M. Sutton, A. Greene. Fuzzing: Brute Force Vulnerability Discovery.

Addison-Wesley, Boston, MA, 2007.

[45] Mitchell Melanie. An introduction to genetic algorithms. Cambridge, Mas-

sachusetts London, England, Fifth printing, 3:62–75, 1999.

[46] Brian Wylie Mike Sconzo. Data hacking sql injection exercise. https:

//github.com/ClickSecurity/data_hacking. Accessed: 2014/8/22, video:

https://www.youtube.com/watch?v=8lF5rBmKhWk.

68

[47] Barton P Miller, Louis Fredriksen, and Bryan So. An empirical study of the

reliability of unix utilities. Communications of the ACM, 33(12):32–44, 1990.

[48] Charlie Miller and Zachary N. J. Peterson. Mobile Systems IV. Technical report,

Independent Security Evaluators, 03 2007.

[49] Adam Muntner.

[50] John Neystadt. Automated penetration testing with white-box fuzzing. MSDN

Library, 2008.

[51] Sanjay Rawat and Laurent Mounier. An evolutionary computing approach for

hunting bu↵er overflow vulnerabilities: A case of aiming in dim light. In Com-

puter Network Defense (EC2ND), 2010 European Conference on, pages 37–45.

IEEE, 2010.

[52] Andy Renk. !exploitable crash analyzer - msec debugger extensions. https:

//msecdbg.codeplex.com/. Accessed: 2016/03/12.

[53] Roger Lee Seagle Jr. A framework for file format fuzzing with genetic algorithms.

2012.

[54] Michael Sipser. Introduction to the Theory of Computation, volume 2. Thomson

Course Technology Boston, 2006.

[55] Soen. Evolving exploits through genetic algorithms, 2013. DEFCON 21.

[56] Sherri Sparks, Shawn Embleton, Ryan Cunningham, and Cli↵ Zou. Automated

vulnerability analysis: Leveraging control flow for evolutionary input crafting. In

Computer Security Applications Conference, 2007. ACSAC 2007. Twenty-Third

Annual, pages 477–486. IEEE, 2007.

69

[57] Omer Tripp, Omri Weisman, and Lotem Guy. Finding your way in the test-

ing jungle: a learning approach to web security testing. In Proceedings of the

2013 International Symposium on Software Testing and Analysis, pages 347–357.

ACM, 2013.

[58] Alan M Turing. Computing machinery and intelligence. Mind, 59(236):433–460,

1950.

[59] Yi-Hsun Wang, Ching-Hao Mao, and Hahn-Ming Lee. Structural learning of at-

tack vectors for generating mutated xss attacks. arXiv preprint arXiv:1009.3711,

2010.

[60] David H Wolpert and William G Macready. No free lunch theorems for opti-

mization. Evolutionary Computation, IEEE Transactions on, 1(1):67–82, 1997.

[61] Dingning Yang, Yuqing Zhang, and Qixu Liu. Blendfuzz: A model-based frame-

work for fuzz testing programs with grammatical inputs. In Trust, Security and

Privacy in Computing and Communications (TrustCom), 2012 IEEE 11th In-

ternational Conference on, pages 1070–1076. IEEE, 2012.

70

SCOTT M. [email protected] ⇧ github.com/sseal

EDUCATION

Wake Forest University May 2016Masters of Science in Computer ScienceOverall GPA: 3.166Wake Forest University May 2013Bachelor of Arts in English & Computer ScienceOverall GPA: 3.352Technical Coursework: Network and Computer Security, Internet Protocols, Algo-rithms, Artificial Intelligence, Operating Systems, Linux Administration, Discrete Math-ematics, Calculus, Linear Algebra

EXPERIENCE

Wake Forest University September 2013 - May 2016Research and Teaching Assistant Winston Salem, NC

· Supported research which implemented a ”Moving Target” security configuration systemfor network hosts

· Conducted thesis research that explored the application of machine learning techniques,language theory, and evolutionary algorithms to optimize SQL-injection and XSS audit-ing approaches

· Organized and taught undergraduate lectures on introductory topics related to operat-ing systems and computer security, involving attacker life-cycle, security vulnerabilityauditing, and secure software practices

Pacific Northwest National Laboratory June 2014 - September 2014Masters Intern Richland, WA

· Developed auto-refresh functionality for a network tra�c visualization application writ-ten in Java

· Provided operational assistance for a company-wide Capture the Flag competition,which involved instructing new participants on attack classes, exploitation techniques,and general secure development practices

· Developed Capture the Flag challenges, including a firewall rules testing applicationwhich utilized the Flask microframework, Scapy packet manipulation software, nginxreverse-proxy and gunicorn HTTP server

B/E Aerospace June 2013 - August 2013Operations Security Intern Winston Salem, NC

· Supported company-wide security operations and incident response handling· Developed automated tools and workflow procedures that increased the e�ciency ofincident management and mitigation

Cisco Systems, Inc. June 2012 - August 2012Software Engineering Intern: R&D Knoxville, TN

· Developed an analytic web application using Ruby on Rails framework· Learned and developed software security analysis skills through independent study andparticipation in CTF challenges within penetration testing environments

71

· Studied and analyzed secure software development practices and related vulnerabilityclasses/attack vectors

TECHNICAL SKILLS

Computer Languages and Technologies: Python, Ruby, Java, C/C++, R

Frameworks, Protocols & APIs: Rails and REST APIs, Flask, Scapy, Nginx,Apache HTTP Server, JSON, pandas, numpy, scikit-learn

Databases & Developer Tools: MySQL, PostgresSQL, SQLite, Git, SVN, Vim,Netbeans

Security Tools: Burp Suite, Metasploit Framework, Backtrack/Kali Linux, gdb, IDADisassembler

72

Documents

OPTIMIZING WEB APPLICATION FUZZING WITH GENETIC … · Chapter 1 Introduction..... 1 1.1 Fuzz Testing for Vulnerability Discovery . . . . . . . . . . . . . . . . 3 ... With the advent