17
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004 357 A Formal Framework for Positive and Negative Detection Schemes Fernando Esponda, Stephanie Forrest, and Paul Helman Abstract—In anomaly detection, the normal behavior of a process is characterized by a model, and deviations from the model are called anomalies. In behavior-based approaches to anomaly detection, the model of normal behavior is constructed from an observed sample of normally occurring patterns. Models of normal behavior can represent either the set of allowed patterns (positive detection) or the set of anomalous patterns (negative detection). A formal framework is given for analyzing the tradeoffs between positive and negative detection schemes in terms of the number of detectors needed to maximize coverage. For realistically sized problems, the universe of possible patterns is too large to represent exactly (in either the positive or negative scheme). Partial matching rules generalize the set of allowable (or unallowable) patterns, and the choice of matching rule affects the tradeoff between positive and negative detection. A new match rule is introduced, called -chunks, and the generalizations induced by different partial matching rules are characterized in terms of the crossover closure. Permutations of the representation can be used to achieve more precise discrimination between normal and anomalous patterns. Quantitative results are given for the recognition ability of contiguous-bits matching together with permutations. Index Terms—Anamoly detection, artificial immune systems, in- trusion detection, negative detection. NOMENCLATURE Length of a string, a pattern or a packet (typically given in bits). Size of the sliding window. Window, a specification of adjacent symbol positions in a string. When the window length is understood, it suffices for to specify only a start position. Number of windows in a string of length . All possible strings of length for a given alphabet. Subset of used to denote a sample of “normal” strings. Subset of used to denote the complete set of “normal” strings. String or packet generated at time . Legitimate packet-generating process. Manuscript received July 18, 2002; revised March 24, 2003. This work was supported by the National Science Foundation under Grants ANIR-9986555 and DBI-0309147, the Office of Naval Research under Grant N00014-99-1- 0417, the Defense Advanced Projects Agency under Grant AGR F30602-00-2- 0584, the Intel Corporation, the Santa Fe Institute, and the Consejo Nacional de Ciencia y Tecnología (México) under Grant 116691/131686. This paper was recommended by Asoociate Editor D. Cook. The authors are with the Computer Science Department, University of New Mexico, Albuquerque, NM 87131 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TSMCB.2003.817026 Malicious packet-generating process. Random variable indicating whether or generated . Hidden process governing the generation of packet . sample Dynamic sample of at instant . String detector. Set of detectors. Crossover closure generation rule of for a given window size . For , is projected onto window . matches under match rule . Scheme Set of strings in such that windows . Scheme Set of strings in such that window . Scheme Set of strings in such that windows . Scheme Set of strings in such that window . I. INTRODUCTION I N anomaly detection, the normal behavior of a process is characterized by a model, and deviations from the model are called anomalies. In behavior-based approaches to anomaly de- tection, the model of normal behavior is constructed from an ob- served sample of normally occurring patterns. Models of normal behavior can represent either the set of allowed instances (pos- itive detection) or the set of all anomalous instances (negative detection). For most real problems, the set of positive instances is much smaller than the set of complementary ones. Thus, it would seem wise to derive a representation of the smaller set, rather than its enormous complement. However, there are several reasons why negative detection merits further consideration. First, it has been used successfully both in engineering applications and by naturally occurring biological systems. Second, if we assume a closed world, then from an information-theory perspective, the normal and abnormal sets both contain the same amount of in- formation, which suggests that there might be equally compact representations of the negative patterns [13]. A third point about negative detection is that it allows the detection process to be distributed across multiple locations with virtually no commu- nication required among the distributed components. This prop- erty allows several forms of distributed processing: checking small sections of a large object independently, independent de- tector sets (e.g., each one running on a separate machine), or 1083-4419/04$20.00 © 2004 IEEE

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004 357

A Formal Framework for Positive and NegativeDetection Schemes

Fernando Esponda, Stephanie Forrest, and Paul Helman

Abstract—In anomaly detection, the normal behavior of aprocess is characterized by a model, and deviations from themodel are called anomalies. In behavior-based approaches toanomaly detection, the model of normal behavior is constructedfrom an observed sample of normally occurring patterns. Modelsof normal behavior can represent either the set of allowedpatterns (positive detection) or the set of anomalous patterns(negative detection). A formal framework is given for analyzingthe tradeoffs between positive and negative detection schemes interms of the number of detectors needed to maximize coverage.For realistically sized problems, the universe of possible patternsis too large to represent exactly (in either the positive or negativescheme). Partial matching rules generalize the set of allowable(or unallowable) patterns, and the choice of matching rule affectsthe tradeoff between positive and negative detection. A newmatch rule is introduced, called -chunks, and the generalizationsinduced by different partial matching rules are characterized interms of the crossover closure. Permutations of the representationcan be used to achieve more precise discrimination betweennormal and anomalous patterns. Quantitative results are given forthe recognition ability of contiguous-bits matching together withpermutations.

Index Terms—Anamoly detection, artificial immune systems, in-trusion detection, negative detection.

NOMENCLATURE

Length of a string, a pattern or a packet(typically given in bits).Size of the sliding window.Window, a specification of adjacentsymbol positions in a string. When thewindow length is understood, it sufficesfor to specify only a start position.Number of windows in a string of length .All possible strings of length for a givenalphabet.Subset of used to denote a sample of“normal” strings.Subset of used to denote the complete setof “normal” strings.String or packet generated at time .Legitimate packet-generating process.

Manuscript received July 18, 2002; revised March 24, 2003. This work wassupported by the National Science Foundation under Grants ANIR-9986555and DBI-0309147, the Office of Naval Research under Grant N00014-99-1-0417, the Defense Advanced Projects Agency under Grant AGR F30602-00-2-0584, the Intel Corporation, the Santa Fe Institute, and the Consejo Nacionalde Ciencia y Tecnología (México) under Grant 116691/131686. This paper wasrecommended by Asoociate Editor D. Cook.

The authors are with the Computer Science Department, University of NewMexico, Albuquerque, NM 87131 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TSMCB.2003.817026

Malicious packet-generating process.Random variable indicating whether or

generated .Hidden process governing the generationof packet .

sample Dynamic sample of at instant .String detector.Set of detectors.Crossover closure generation rule of fora given window size .For , is projected ontowindow .

matches under match rule .Scheme Set of strings in such that

windows .Scheme Set of strings in such that

window .Scheme Set of strings in such that

windows .Scheme Set of strings in such that

window .

I. INTRODUCTION

I N anomaly detection, the normal behavior of a process ischaracterized by a model, and deviations from the model are

called anomalies. In behavior-based approaches to anomaly de-tection, the model of normal behavior is constructed from an ob-served sample of normally occurring patterns. Models of normalbehavior can represent either the set of allowed instances (pos-itive detection) or the set of all anomalous instances (negativedetection).

For most real problems, the set of positive instances is muchsmaller than the set of complementary ones. Thus, it would seemwise to derive a representation of the smaller set, rather thanits enormous complement. However, there are several reasonswhy negative detection merits further consideration. First, it hasbeen used successfully both in engineering applications and bynaturally occurring biological systems. Second, if we assume aclosed world, then from an information-theory perspective, thenormal and abnormal sets both contain the same amount of in-formation, which suggests that there might be equally compactrepresentations of the negative patterns [13]. A third point aboutnegative detection is that it allows the detection process to bedistributed across multiple locations with virtually no commu-nication required among the distributed components. This prop-erty allows several forms of distributed processing: checkingsmall sections of a large object independently, independent de-tector sets (e.g., each one running on a separate machine), or

1083-4419/04$20.00 © 2004 IEEE

358 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

Fig. 1. Self nonself discrimination. A universe of data points is partitioned intotwo sets: self and nonself. Negative detectors cover subsets of nonself.

evaluating each individual detector in a detector set indepen-dently. As the scale of anomaly-detection problems increases,the need for distributed processing is likely to grow. Indeed, theoriginal inspiration for the negative-detection approach camefrom the natural immune system, which uses negative detectionin a massively distributed environment—the body.

The negative-detection approach to anomaly detection, whichwas originally published as the negative-selection algorithm [1],[3], was modeled after a method used by the natural immunesystem to prevent autoimmunity [37]. In the immune system,certain cells known as T cells undergo a multistage maturationprocess in an isolated environment: an organ called the thymus.While in the thymus, T cells are censored against the normallyoccurring peptide patterns of the body called self. T cells thatreact with self are deleted in the thymus before they can be-come active and cause autoimmunity. The only T cells allowedto mature and leave the thymus are those that survive this cen-soring operation.1 Such cells then circulate through the bodyfreely and independently, eliminating any material that they canbind. Because of the censoring process, such material is implic-itly assumed to be foreign and is known as nonself.

The translation of this mechanism into an algorithm for com-puters is straightforward. First, we assume that the anomaly-de-tection problem is posed as a set (real-self) of strings , allof fixed-length , of which we can access only a sample atany given time. The universe of all -length strings is referredto as , and the set of anomalous patterns to be detected is theset . We consider all strings to be binary, although theanalysis is generalizable to a larger alphabet. Candidate detec-tors are generated randomly and are censored against ; thosethat fail to match any of the strings in are retained as activedetectors. Such detectors are known as negative detectors, andif is a good sample of , each negative detector will cover(match) a subset of nonself. The idea is that by generating suf-ficient numbers of independent detectors, good coverage of thenonself set will be obtained. Fig. 1 shows the relationship ofthese sets pictorially.

This algorithm learns to distinguish a set of normally occur-ring patterns (self) from its complement (nonself) when onlypositive instances of the class are available. In the machinelearning literature, this has been studied as the problem of

1Mature T-cells have survived at least two other censoring operations: oneinvolving genetic rearrangements and one involving positive selection.

learning from examples of a single class or as concept learningfrom positive examples [25], [68]. Although in principle thislearning problem is solvable for noise-free classes and certainformal domains, in many practical cases, the problem is knownto be computationally intractable [27] or to lead to substantialovergeneralization [26]. The statistical community has ex-amined the closely related problems of outlier detection androbust statistics, finding that the effectiveness of the learningsystem depends critically on domain-dependent assumptionsabout the distributions of self and nonself data. For example,in a computer security context, Lane employed time-seriesmodels of user behaviors and a prior bias against false alarmsto differentiate intruders from authorized system users [28].

There are many well known change-detection and check-sumalgorithms that solve a restricted form of the anomaly-detectionproblem, known as change detection. Here, self is known ex-actly, is small enough to be stored in a single location, remainsconstant over time, and can be unambiguously distinguishedfrom nonself. However, for cases in which these assumptionsdo not hold, the discrimination task is more challenging, and inthese situations, the negative-detection approach may be appro-priate.

Since its introduction in [1], interest in negative detectionhas been growing, especially for applications in which noticinganomalous patterns is important, including computer virus de-tection [21], [22], intrusion detection [8], [15], [16], and indus-trial milling operations [4]. Recently, other categories of ap-plications have been proposed, including color image classifi-cation [34] and collaborative filtering and evolutionary design[12]. Underlying all this work, however, is the question of neg-ative versus positive detection: When is it advantageous to use anegative-detection scheme instead of the more straightforwardpositive-detection scheme?

To date, there has been little theoretical analysis of the relativemerits of positive versus negative detection, and that is the pri-mary goal of this paper. Such analysis is complicated by severalfactors, which we also address. Most important is the choice ofrepresentation and match rule. How is the set of normal patternsrepresented? How are anomaly detectors represented? What arethe criteria for determining when a detector has detected a pat-tern? The analysis is further complicated if the self set is toolarge or too distributed to observe completely, if it is nonsta-tionary (self sets that change over time), or if the detectors arechanging over time.

In this paper, a formal framework is given for analyzing thetradeoffs between positive and negative detection schemes interms of the number of detectors needed for maximum coverage.Earlier analyses focused on the cost of generating detectors (seeSection II), rather than on the coverage provided by a completeset of detectors. We then discuss the need for partial matchingrules and generalization, as well as what this implies for com-paring positive and negative detection. We review the -con-tiguous bits (rcb) matching rule, which is the most commonmatch rule used with negative detection. We introduce a relatedmatch rule, called -chunks, discuss its advantages, and char-acterize the generalizations that are induced by it in terms ofa concept known as the crossover closure (see Fig. 3), givingthe expected size of the crossover closure for -chunks. We are

ESPONDA et al.: FORMAL FRAMEWORK FOR POSITIVE AND NEGATIVE DETECTION SCHEMES 359

then in a position to compare quantitatively the positive and neg-ative detection schemes under -chunks. Finally, we consider amethod for increasing the diversity of coverage, known as per-mutation masks [8], [14], show that permutation masks reducethe number of nondetectable strings in the nonself set for anyfixed detection threshold, and give quantitative results on howmany permutations are required to maximize this effect.

One consequence of using an approximate scheme for rep-resenting a class of explicit instances (e.g., the self set) is thatthe class will implicitly contain some instances that were notin the original sample. If the approximation is a good one, thenthis is an advantage, and the approximation is said to generalizefrom the set of observed patterns to a useful class. In the case ofanomaly detection, new observations that are similar to the orig-inal sample (self) are classified as part of self, and false positivesare reduced. If the generalization is a poor one, new observa-tions will frequently be misclassified, leading to high false-posi-tives and/or false-negatives. In this paper, partial matching rulesspecify the approximation, and the parameter controls howlarge the approximation is. We are interested in characterizingas precisely as possible the kinds of generalizations that are in-duced by our partial match rules. When we discuss the entire setcovered by the approximation, we refer to it as the generaliza-tion. When we discuss individual strings in the generalizationthat were not in the original sample, we refer to them as holes[3]: strings not in for which valid fixed- detectors cannot begenerated.

II. RELATED WORK

Outlier detection is a problem arising in many machinelearning and data mining contexts. Abstractly, two families ofapproaches can be identified. Outlier detection with respect toprobability distributions attempts to identify those events thatare in the low-density regions of the probability distributiongoverning the generation of normal events [49], [50], [52],[58]. Under this approach, which we refer to as statisticalanomaly detection, the degree of suspicion attached to an eventis inversely proportional to the frequency with which the eventhas been observed historically.

A second family of approaches to outlier detection attemptsto define a measure of distance on the event space [51], [54],[59], [60], [62], [64]. Common distance measures include theclassic -norm and Hamming distances, Manhattan distance(also known as rectilinear, which generalizes Hamming distanceto nonbinary coordinate systems), and the vector cosine measureof Salton [65]. Under this approach, the degree of suspicion at-tached to an event is directly proportional to the distance of theevent from, for example, the nearest observed normal event orthe center of mass of clusters of normal events.

The main difficulty in applying statistical anomaly detectionis that the probability distribution is not known exactly and mustbe estimated from a sample. Often, this sample is sparse, makingdensity estimation highly error prone. In many applications, themajority of events we must judge do not appear in the sample atall and, hence, must be classified as “most suspect.” The maindifficulty in applying distance criteria is in the construction of ameasure that reflects a useful metric of similarity. A poor choiceof measures results in meaningless classifications.

Often, either implicitly or explicitly, the approaches are com-bined [48], [61]. For example, data reduction transformations,such as feature selection and value aggregation, collapse dis-tinct events into clusters of events that are to be treated as ifthey were indistinguishable [57], [66]. An event is assigned thecomposite density of the cluster into which it falls. In this way,many events are clustered together, all of which are classified aseither suspect or not based on the cluster’s composite density.The most suspect events will cluster with no previously encoun-tered events (forming a low density cluster of its own).

Statistical anomaly detection is often used on the problemof intrusion detection in computer security, for example, [15],[23], [35], [41]–[45], [55]. Intrusion-detection systems varywidely, but they all seek to protect an information system(e.g., a single computer, a database, a server, or a network ofcomputers) from violations of the system’s security policy. Inthe case of anomaly intrusion-detection systems, protection isprovided by building a model of the normal (legal) operationof the system and treating discrepancies from the model asanomalies. The model of normal behavior can be based onany observable behavior of the system. Audit logs, patternsof network traffic, user commands, and system calls are allcommon choices. Such an approach can work well if theanomalies in normal operation are well-correlated with securityviolations. The extent to which this is true is a topic of currentdebate in the intrusion-detection community.

In [1], the negative-selection algorithm was introduced as amethod for representing and generating distributed sets of de-tectors to perform change detection. The negative-selection al-gorithm was demonstrated and analyzed in the context of thercb match rule [38], [39], which is a partial-matching rule be-tween two strings similar to Hamming distance. In that paper,a probabilistic expected-case analysis of the algorithm’s perfor-mance was given for the case where the protected data and thedetectors are assumed to be static, and the protected data arerandomly generated. The analysis relied on the assumption thatindependently generated detectors would have independent de-tection capabilities. Use of the algorithm was illustrated withthe problem of computer virus detection, and it was noted thatobserved performance on real data sets was often significantlybetter than that predicted by the expected-case analysis. Thisimproved performance was ascribed to regularities in the pro-tected data (self).

References [3] and [13] reported a linear time algorithm forgenerating detectors (linear in the number of protected strings)that uses space, where is a threshold value and

is the length of the strings. This new algorithm addressed animportant limitation as the cost of generating detectors using theoriginal algorithm increased exponentially with the size of theprotected data. Limits to the precision of coverage that is pos-sible under most plausible matching rules were also identified.These limits arise because partial matching rules obscure theboundary between the self and nonself sets. In [3], the number ofsuch gaps, called holes, for certain partial matching rules werecounted, and a lower bound on the size of the detector set neededto achieve a given probability of detecting abnormal strings wasestimated. Following up on this work, Wierzchon developeda low space-complexity algorithm for generating an optimal

360 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

repertoire of detector strings [17]–[19] and included an analysisof the holes in coverage associated with his algorithm. In thiswork, it was assumed that both the protected data and the detec-tors are unchanging through time. Likewise, both projects fo-cused on the problem of generating detectors efficiently, ratherthan the problem of how many detectors are needed for max-imal performance. The latter question is the one addressed inthis paper.

Lamont and Harmer applied negative selection to the problemof computer virus detection, focusing on different match rules[21]. They developed a prototype implementation to demon-strate that the approach is both scalable and effective. As partof their work, they evaluated 12 different match rules includingHamming distance and rcb, concluding that the Rogers and Tan-imoto rule (an extension of Hamming distance) was the best fitfor their application because it did the best job of balancing thegenerality and specificity of each individual detector.

Hofmeyr [7], [8], [14] described a network intrusion-detec-tion system that incorporated a modified form of the negative-selection algorithm, together with several other mechanisms, in-cluding the use of permutation maps to achieve diversity of rep-resentation. His experiments were the first in this line of workto consider dynamically changing data sets (e.g., when the defi-nition of normal behavior is either incomplete or nonstationary)and a distributed environment. The setting was network intru-sion detection on a local area network, and the detectors mon-itored the flow of TCP SYN packets through the network. Theoriginal negative-selection algorithm was modified to accom-modate dynamic data sets; detectors were generated asynchro-nously and allowed to undergo negative selection in the live en-vironment of the network. Thus, the intrusion-detection systemfeatured rolling sets of detectors in the sense that detectors gen-erated at different times were potentially exposed to slightly dif-fering patterns of self.

Kim and Bentley also implemented a network intrusion-de-tection system which incorporates negative selection [16]. Theirimplementation used a significantly more complicated represen-tation than [14] with a much larger alphabet, shorter strings, andmore general matching thresholds (lower in -contiguous bitsmatching). Like Hofmeyr, they used the original random-gener-ation method of producing detectors. They reported difficultiesgenerating valid detectors and concluded that negative selectionis infeasible on its own for realistic network intrusion detec-tion. Balthrop et al. [10] propose several explanations for theirresults, based largely on an analysis of the parameter settingsthat were selected. Williams et al. [15] report positive resultswhen using negative selection on the network intrusion detec-tion problem. Their results and approach more closely parallelthat taken in Hofmeyr’s work. Their implementation goes be-yond [14] and [16], however, in that it monitors UDP and ICMPpackets as well as TCP packets.

Dasgupta and Gonzalez used negative detection on the net-work intrusion-detection problem [33]. They worked with theLincoln Labs data set [29], used a real-valued representation,and divided the Lincoln Labs data into overlapping intervals.The paper is interesting because it compares the performanceof positive and negative characterization of self, reporting thatthe positive approach is more accurate but also computation-

ally more expensive. It is not obvious how meaningful the com-parison is, however, because the positive and negative methodsare quite different. The positive-detection method simply mem-orizes a set of training (attack-free) points and classifies any testpoint as “normal” if it is within a given distance of any mem-orized self-point. The negative-detection method uses a geneticalgorithm to evolve complex rules for detecting nonself. Thus,the study could more properly be viewed as comparing evolvedrules with a simple memory-based approach.

In addition to intrusion detection, several authors have beeninterested in using negative detection for other applications.Sathyanath and Sahin used negative detection for color imageclassification of finished wooden components: specifically,kitchen cabinets [34]. Bradley and Tyrrell describe a hardwareimplementation using field programmable gate arrays (FPGAs)[30]–[32]. Their detectors monitor transitions, together withsome input/output information, in finite-state machines todetect faults. The implementation uses negative detection andrcb match rules. Detectors are generated offline, and the entireself set is known at the time detectors are being generated.Chao and Forrest [12] describe a negative-detection approachto interactive search algorithms, in which subjective evaluationsdrive the exploration of large parameter spaces. Their algorithmlearns what parts of the search space are not useful, based onthe negative-detection strategy of the natural immune system.The algorithm is capable of finding consensus solutions forparties with different selection criteria.

Sometimes, the set of anomalous patterns is known, orthought to be known, exactly, and a set of signatures is used tocover the dangerous parts of nonself precisely. Examples of thisapproach are signature-based scanners for intrusion detection[46] and commercial virus-scanners.

III. STATISTICAL MODELING FRAMEWORK

In this section, we cast the anomaly intrusion-detectionproblem in a statistical framework. Although the unrestrictedversion of the framework is quite complex, it is neverthelessuseful to begin with the overall view. This helps us makerigorous certain notions which are required later in the paper,and it allows future work to evolve toward meeting broaderobjectives.

Our basic unit of analysis is a length binary string, calleda packet. At each time step , a packet is generated. Weassume that is generated either by the legitimate processor the malicious process . In reality, there could be any numberof subprocesses contributing to each of these processes (e.g.,many legitimate users with different behaviors), but we proceedas if there were a single and a single .

More precisely, a sequence of lengthbinary strings (packets) is generated. A hidden process gov-erns the generation of the packet at time . Random vari-able if process generates and ifprocess generates . In general, it may be the case that thedistribution of random variable depends on the sequences

and of randomvariables, as well as on the identity of the particular time step. In the remainder of this paper, however, we assume that the

ESPONDA et al.: FORMAL FRAMEWORK FOR POSITIVE AND NEGATIVE DETECTION SCHEMES 361

distribution of is independent of and of and for. Thus, and , which are also

referred to as the prior probabilities of the processes and, respectively, are unchanging over time.

There are several (typically unknown) next packet distribu-tions of potential interest; estimating these distributions, undervarious assumptions to be introduced below, is a major focus ofthis paper. The pair of distributions

specify next packet distributions under and , respectively,conditioned on knowledge of the first packets gener-ated, plus a specification (the values of ) of which processgenerated each of these . Estimating these distributionsfrom the data amounts to a supervised learning problem sincethe label the training data , in-dicating which process generated each packet. Note that theseprobabilities are related via Bayes theorem to the posterior dis-tribution on which process or generated the current packet

. See Section III-B for details. Note also that the mixturedistribution for the next packet generated is

In the applications studied in this paper, we have no directknowledge of the values of the random variables , that is,we do not know with certainty which of the processes orgenerated the previous packets in the sequence, leaving us withan unsupervised learning problem. In particular, we will be con-cerned with estimating the pair of distributions

when we have knowledge of the process priors and(that is, knowledge of the distribution of the hidden process ),or with estimating the pair of distributions

when we do not have knowledge of the process priors.For some problems, the ultimate objective is to identify short

temporal groupings of anomalous packets or the time intervalsmost likely to contain packets generated by . Many intrusion-detection problems have this quality, and the above formulationsupports such an analysis. However, there are important specialcases in which the analysis shifts from sequences of packets tosingle packets, and these are the focus here.

The next packet distribution of each process may depend onthe current time step , as well as on the pattern of packets gener-ated at time steps prior to . This allows the possibility that eachof and is nonstationary, where its distribution depends both

on the actual time step and the identity of packets previouslygenerated by both processes. This nonstationarity might reflect,for example, that we have the following.

• Composite user behavior changes over time (differentpopulations of users).

• A user’s behavior at time depends on behaviors at times(a single user’s action depends on his or her prior

actions).• The nature of attacks changes over time.• There are changes in the environment (e.g., a new com-

puter added to a network).We abstract away the question of whether and how the true

distributions of and change over time by specifying ourposterior of the next packet generated at any instant intime, given some dynamic sample of at instant . That is, inthis paper, we focus on situations in which it suffices to replace

or

by sample

where sample is an (unordered) set of packets. samplemay include a subset of the packets , restrictedto packets deemed to have been generated by (i.e., packets

for which it is deemed likely that ) and perhapsfurther restricted by other criteria (e.g., recency, random sam-pling). sample may also be seeded with an initial populationof packets.

sample reflects our current state of knowledge of at time, and the next packet generated by is dependent only on this

knowledge—given sample , ’s distribution is condition-ally independent of all other information (e.g., of other previ-ously generated packets). As sample changes with time (seebelow), the next packet distribution changes. Whether or notprocess actually is changing over time (or whether, for ex-ample, our state of knowledge simply is changing) is immate-rial to the model. Further, because sample is unordered andassumed to be generated by or otherwise representative of ,the detection question of interest is to decide whether the nextpacket is generated by or ; our model does not addressthe problem of deciding whether a subsequence embedded in

is generated by or .In the remainder of this paper, we consider a simple family of

distributions sample as our modelingof how generates its next packet. Intuitively, we assume a dis-tance measure between an arbitrary packet and the currentsample . The closer a packet is to sample , the greaterthe probability sample . Many notionsof distance are possible. For example, the distance of a packet

from sample could be ’s minimum Hamming distanceto a member of sample . Note that this measure does not dis-tinguish between members of sample (assigning each suchstring a distance of 0) but does differentiate between packetsnot in sample . Alternatively, distance could be based on fre-quency counts; the “distance” of from sample could be thereciprocal of the number of instances of in sample , with1/0 defined to be some large constant . This distance measurefails to distinguish between packets not in sample (assigning

362 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

each such string a distance of ) but does differentiate amongpackets in sample . Of course, one could develop a hybrid ofthese two measures in which a distance based on Hamming dis-tance is used for strings outside of sample , and a distancebased on frequency counts is used for strings in sample .

We adopt a simple categorical division into “similar tosample ” versus “dissimilar from sample ” and distin-guish these distance categories by means of a generation rulethat attempts to characterize the underlying set from whichsample is likely drawn. Unlike the two distance measuresmentioned earlier, a generation rule allows only the distinctionbetween likely and unlikely strings under , which simplifiesthe detection task considerably. Experimental studies haveshown that these simplifications often capture sufficient detailof process behavior to provide effective detection [5], [8].

Definition: A generation rule is a mapping from a set oflength strings to a set of length strings containing .This is the generalization discussed in Section I.

Given a generation rule , we wish to specify a distributionto reflect that each member of sample is more likely tobe generated at time by than is each nonmember. Severalsimple alternatives exist, though care must be taken to ensurea proper probability distribution results. For example, for someapplications, it may be appropriate to have the probability of theset sample be some large multiplicative factor of theprobability of the complement of this set. We can construct aprobability distribution capturing this relative frequency by firstsetting sample , sample andthen (noting sample samplenormalizing to obtain proper point probabilities for individualstrings, that is

sample

if sample

if sample

Assuming there are more strings in the complement ofsample than in the set itself, the point probability of each

member string would be greater than that of each nonmemberstring. Alternatively, in other contexts (e.g., when sampleis larger than its complement), it would be necessary to specifydirectly that the common point probability of each memberof sample is some large multiple of the common pointprobability of each nonmember, and then again normalizeto proper probabilities by taking into account the number ofmembers and nonmembers.

Whatever method is used to set the distribution, we assumesample is specified by

sample

if sampleif sample

where and satisfy , and .As a final comment, we note that under this modeling,

although the generation rule and measure of distance (e.g.,member or not of the set ) are assumed fixed and knownexactly (as opposed to being dependent on the data), theprobabilities assigned to under do change as the sample

changes with time. In particular, since sample typically isdependent on previously observed packets—evolving with ourknowledge of the true image of process —our distribution ofthe next packet generated changes as the data changes.

A. Example Generating Rules

We now present three sample generation rules that can beused to characterize the underlying set from which sampleis drawn. The first is Hamming radius, which generalizessample to strings within a given Hamming distance ofstrings in the sample. Next, we consider crossover closure,which generalizes sample to strings that can be constructedby pasting together the windows of strings appearing insample . Finally, we examine -gram matching, whichgeneralizes sample to strings whose -grams have beenobserved in the sample .

1) Example: Hamming Radius: The Hamming distancemeasure discussed above can be simplified to a categoricalmeasure based on a threshold . In particular, the generationrule is defined to be the set of strings within aHamming distance of or less to some member of . The useof this generation rule in detection is considered in Section VI.Hamming distance match rules have been used with negativedetection in the context of immunological modeling, e.g., [6].

2) Example: Crossover Closure: The generation rule weconsider in the most detail is called “crossover closure.” Aswe will show in subsequent sections, the crossover closure isclosely related to the two match rules most commonly used inconjunction with negative detection.

Given a set of strings and a fixed , the crossoverclosure of is defined in terms of its length windows(i.e., consecutive string positions) as

windows

In words, string is in the crossover closure of if andonly if each of ’s windows exactly matches the correspondingwindow of some member of . When is such that

, we say that is closed under crossover closure. As we willsee, the class of sets closed under crossover closure has a centralrole in our methods.

One interesting motivation for crossover closure comes fromrelational database theory [53], [63]. The strings of a samplesample can be viewed as the tuples of the current instance ofa relation scheme , where each attribute corre-sponds to packet position and has domain {0,1}. For example,the sample can be represented as a relationscheme , whose current instance is shown as

For a variety of reasons (e.g., to reduce redundancyand enhance data integrity), it is often advantageous torepresent a relation scheme as a decomposition into acollection of smaller relation schemes. Consider repre-senting the scheme as a decompositioninto schemes , ,

, and . The instance

ESPONDA et al.: FORMAL FRAMEWORK FOR POSITIVE AND NEGATIVE DETECTION SCHEMES 363

of each is the projection of onto or, equivalently, isexactly the set of strings comprising the length window ofsample . In the previous example, taking , the instancesof the are as follows:

In order to reconstruct the original instance of from theseprojections onto the , one computes the natural join of the .However, it is not always the case that the join of the projectionrecovers the original instance of ; in fact, the join of the pro-jected instances is precisely the crossover closure of the set oftuples (strings) in the original instance of . In our example, thejoin of the instances shown above results in thefollowing instance of , which can be seen to be the crossoverclosure of the strings in the original instance of .

Relational database theory has exactly characterized when theprojections of join to recover the original instance (see Sec-tion VII-B and [53] and [63]). This is known as the lossless joincondition. The lossless join condition thus also characterizes ex-actly when a set of packets is equal to its crossover closure. Oneinterpretation of this correspondence is that since many natu-rally occurring collections of information do in fact satisfy thelossless join condition, it is plausible that in many contexts, themost likely set of strings to be generated by is closed undercrossover closure. When this is the case, sample , if it is arepresentative sample of strings generated by , can be inter-preted as being a sample drawn from sample or froma larger set containing sample and itself closed undercrossover. In such situations, it is appropriate to deem packetsthat are members of sample as relatively likely under

.3) Example: -Grams: The -gram matching rule has been

used in a wide variety of settings including natural language pro-cessing [40], document classification [47], and program moni-toring [5], [20], [24], [36] for intrusion detection. In the latter ap-plication, -grams can be applied when the behavior of the pro-tected process (for example, an executing computer program)can be represented as a sequence of letters taken from somealphabet. A sample of normal behavior is a collection ofsuch sequences, called a “trace” collected under normal oper-ating conditions. Different traces can be of different lengths andare parsed using a sliding window of size . Each substring oflength (the -gram) is stored in the program’s profile.2 Oncethe profile is complete, a new execution is analyzed in terms ofthe -grams composing its trace. Any -gram that occurs duringexecution, which does not appear in the profile, constitutes ananomaly.

2The n-gram method described here is a varient of the lookahead pairsmethod described in [2] and [9].

In order to characterize the generation rule for -grams, wefirst restrict our attention to the generalization induced overtraces of length and create a directed acyclic graph (DAG)with levels, where each level has a node for eachdistinct observed -gram. A node with label in level is con-nected to a node with label in level if the lastsymbols of match the first symbols of (akin to theprevious example). A similar graphical construction was usedin [69] for determining multiple length -grams. The set of pro-tected traces of length can be retrieved by traversing all pos-sible paths, from the first to the last level, in this graph. We writethe generalization induced for traces of size under -grams as

: the crossover closure at (see Example 2). Finally, re-stricting traces to specific lengths, the generalization achievedby -gram matching can be expressed as

where is the set of acceptable lengths.There are two differences between this generation rule and

that presented in Section III-A2. In the latter, we considered allstrings to be of the same specific length ; this is a small as-sumption that could easily be relaxed. More importantly, in Sec-tion III-B2, we accept only those patterns that have been previ-ously observed at a particular window position, whereas with

-grams, we disregard the information about window location.

B. Computing Posterior Probabilities on Intrusion Events

In general, deciding whether is generated by or re-quires 1) distributions for as well as and 2) process priors

and . In previous work [55]–[57], we consideredmodels in which sample is uniformover all strings in versus models for specific types of mali-cious behavior. Modeling sample asuniform over all is a conservative approach, and it is consis-tent with the view that we have little prior information aboutlikely forms of attack. Modeling as uniform places a constant

in the numerator of the likelihood ratio

samplesample

and, hence, equates the posterior probability

samplesample

sample

sample

with the degree of ’s rareness under . (With the nu-merator of the expression for the posterior

sample fixed at constant , onlysample varies as varies.)

We identified this approach with what the literature commonlyrefers to as anomaly detection and reported experimental resultswith examples both when anomaly detection is sufficient andwhen it is not (and specific models of malicious behavior arenecessary for satisfactory detector performance).

364 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

Here, we assume that is uniform. With this assumption, wecan either specify process priors and and a detec-tion threshold on the posterior sample ,or, as in [56] and [57], we can simply prioritize the packetsby the likelihood they were generated by , that is, rankthe packets by the values of their likelihood ratios; thisgives an ordering equivalent to ranking by the posterior

sample , regardless of the values ofthe priors and . With constant in the numerator ofthe likelihood ratio, and the process modeled as described inthe previous section, ranking packets by their likelihood ratiosreduces to partitioning packets into two classes: those that aremost likely generated by and those most likely not generatedby . We are interested in compact schemes that allow usto distinguish quickly between low- and high-probabilityanomalous packets. Under the family of distributions proposedabove, this task is equivalent to distinguishing, for a given gen-eration rule and current sample sample , between packetscontained in sample and those outside this set. In thefollowing sections, we present positive and negative detectionschemes for addressing this problem. We also compare the timeand space efficiencies of several variants of these schemes.All of these schemes use simple and efficient string matchingrules for deciding whether or not a packet is a member of

sample .

IV. TAXONOMY OF DETECTION SCHEMES

This section explores several detection schemes based on theframework outlined above. Given a generation rule and acurrent sample sample , an instance of a detection scheme(known simply as a detection algorithm) must be able to decidewhether or not a packet is a member of sample . Thatis, a detection algorithm can be viewed as protecting the subset

sample . Strings within sample are regardedas likely to be part of self, whereas those in sampleare most likely to be anomalous. From the perspective of formallanguages, a detection scheme is a language recognition model,and, analogous to the issues arising in the study of formal lan-guages, we are interested in how to construct a detection schemethat can protect the class of languages implied by a particulargeneration rule of interest, as well as in determining the class oflanguages that various detection schemes can protect.

Many detection schemes can be constructed from the simplebuilding blocks outlined so far—the simple string matchingrules and the alternative interpretations associated with a match.Before proceeding to analyze specific cases of interest, e.g.,[8], we first outline a systematic taxonomy for all the schemes.This taxonomy consists of three dimensions:

• the form of a single detector together with its matchrule—a binary relation on detectors and packets, speci-fying when a single detector and single packet match;

• whether positive or negative detection is employed;• disjunctive versus conjunctive matchings: Whether a

single match is sufficient to trigger a decision (positive ornegative) or whether multiple matches are required.

In the following sections, we study the relative power of the de-tection schemes that can be constructed from different choices

along the above dimensions. We first compare the classes of lan-guages that can be protected by the schemes and then considerthe resources (time and space) required by the schemes to pro-tect their languages.

A. Form of Detector and Matching Rule

A single detector, as considered in this paper, is either

1) an element of ;2) an -chunk, which is a length binary string along with

a specified window position.When a detector is an element of , we are interested pri-marily in the rcb match rule, that is, for

We refer to such detectors under the rcb match rule as rcbdetectors. (In Section VI, we contrast the rcb match rule withHamming distance, which is an alternative match rule for de-tectors in .) If is an -chunk on window , the matchingrule we consider is, for

B. Detector Operation

Let be a set of rcb or -chunk detectors, and assume the cor-responding matching rule from the previous subsection. We nowconsider the interpretation given to a match between a packetand a detector. In particular, does a match mean the packet is“in the language” (positive detection, denoted by ) or “out-side the language” (negative detection, denoted by )? Further,do we require a single match anywhere in the packet (disjunc-tive matching, denoted by ), or a match in all packet windowpositions (conjunctive matching, denoted by )? The four com-binations of possible answers specify four corresponding detec-tion schemes: , , , and .

Let denote a choice of or and a choice of or .With and specified, a collection of detectors (either rcbor -chunk) acts as a parameter, instantiating a detection schemeScheme . In particular, given the following interpretationsfor , , , and , Scheme exactly defines a set of al-lowable strings; Scheme protects, or recognizes, this set,flagging strings as anomalous that are outside Scheme .

The language (subset of ) “accepted” by each of the fourdetection schemes when instantiated with a fixed set of rcbor -chunk detectors is defined as follows:

1) Negative disjunctive detection Scheme :Scheme is the set of strings in such that

windows .2) Positive disjunctive detection Scheme :

Scheme is the set of strings in such thatwindow .

3) Positive conjunctive detection Scheme :Scheme is the set of strings in such that

windows .4) Negative conjunctive detection Scheme :

Scheme is the set of strings in such thatwindow .

ESPONDA et al.: FORMAL FRAMEWORK FOR POSITIVE AND NEGATIVE DETECTION SCHEMES 365

C. -Chunks Matching Subsumes rcb Matching

We say that is a language recognized by Schemeusing -chunks [rcb] detectors if there exists a set of -chunks[rcb] detectors such that Scheme . From the per-spective of language recognition, but not necessarily efficiency,-chunk detectors subsume rcb detectors, as we now demon-

strate.Lemma 3.1: If is an rcb detector, there exists a collection

of -chunk detectors such that ,, .

Proof: Take to be , and the result followsimmediately.

To illustrate the construction of used in the proof of theprevious lemma, let be an -contiguous bit detectorwith , . Then, , as shownbelow, is the following equivalent set of -chunk detectors:

It follows from the lemma that for each of the four detectionschemes defined above, any set of strings that can berecognized with a set of rcb detectors can be recognized withsome set of -chunk detectors. In particular, we have the fol-lowing theorem.

Theorem 3.1: If is a set of rcb detectors, and andare any choices for detector dimensions as above, then

Scheme Scheme .Proof: It follows from Lemma 3.1 that for any of the

membership predicates associated with any of the fourdetection schemes Scheme and for every string ,

satisfies satisfies .The converse of this theorem is not true in general, since

-chunk detectors have a finer granularity than do rcb detectors;a proper subset of may match fewer strings in than doesthe rcb .

Example: Consider the detection scheme Scheme , andlet and . Consider the pair of strings 011 and 010. Weclaim that whenever both strings 011 and 010 are in the languageof Scheme for a set of rcb detectors, then so too mustbe the string 110. To see this, note that an rcb detector matching110 must either end in the pattern or begin with the pattern

, but then, any way either of these patterns is completed (i.e.,by specifying a bit for the ) results in an rcb detector matchingone of 011 or 010. Consequently, if Scheme excludes110 from the language it recognizes, it must exclude also at leastone of 011 or 010. In contrast, if the -chunks detector setconsists of the single detector , Scheme includesboth 011 and 010 while excluding 110.

The results of the next section imply that the same result holdsfor Scheme , that is, that the class of languages recognized byScheme using -chunks detectors properly contains the classrecognized by Scheme using rcb detectors.

D. Detection Schemes and the Crossover Closure

The following theorem, which was established by simpleset-theoretic arguments, helps clarify the relationships between

the classes of languages recognized by the various detectionschemes.

Theorem 4.1: Let be any set of rcb or -chunks detectors,and let denote the complement of relative to the universe ofall rcb or -chunks detectors over the same alphabet and of thesame length as the detectors in . Similarly, the complement ofthe set Scheme (which is a subset of ) is taken relativeto .

Then

Scheme Scheme

Scheme Scheme

Scheme Scheme

Scheme Scheme

Further, the subset containment is proper for some .It follows from Scheme Scheme that

the class of languages recognized by Scheme is identicalto the class recognized by Scheme , when either rcb or-chunk detectors are considered. Similarly, it follows from

Scheme Scheme that the class of languagesrecognized by Scheme is identical to the class recognizedby Scheme , when either rcb or -chunk detectors areconsidered.

As discussed in Section III, one of our most important detec-tion applications is to protect the crossover closure of a samplesample . There is a strong relationship between crossover clo-sure and the schemes Scheme and Scheme when -chunkdetectors are used.

Let sample be any subset of . Let denote the set ofwindows present in sample , that is, the union of the projec-tions of sample onto each window (the window size is fixedand understood). Let denote the set of windows not presentin sample , that is, . We then have the followingtheorem.

Theorem 4.2: Scheme SchemeScheme sample

Proof: Schemesample , sample

Further, crossover closure exactly characterizes the class oflanguages recognized by Scheme and Scheme when-chunk detectors are used. That is, we have the following

theorem.Theorem 4.3: The class of languages recognized by

Scheme and Scheme when -chunk detectors are used isexactly the class of sets closed under crossover closure.

Proof: If a set is closed under crossover closure, i.e.,, then we can construct a set of detectors

from the windows present in . It follows from Theorem4.2 that Scheme Scheme .By the definition of Scheme Scheme if forall windows , where is an arbitrary set ofdetectors. We construct a set of detectors by taking, forevery window and every Scheme , the projections

such that . It follows thatScheme Scheme . Finally, using

Theorem 4.2, we have Scheme SchemeScheme Scheme .

366 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

Results of the previous section imply that when rcb detec-tors are used, Scheme and Scheme can recognize onlysets closed under crossover closure but not all sets closed undercrossover closure since the class of languages recognized bythese schemes when rcb detectors are used is properly containedby the class recognized when -chunks are used.

Finally, consider the schemes Scheme and Schemeusing -chunks. These schemes do not recognize all sets closedunder crossover closure but do recognize some sets not closedunder crossover closure. To see this, we first observe that thereexist sets such that is closed under crossover but is not.For example, let , , and . ,and hence, is closed under crossover. contains the otherseven length 3 strings, and .

Therefore, we have the following.

1) Scheme and Scheme cannot recognize all setsclosed under crossover because if Scheme recognizesa set closed under crossover such that is not closedunder crossover, Scheme Schemeimplies that Scheme recognizes .

2) Scheme and Scheme can recognize somesets not closed under crossover because Schemecan recognize a set closed under crossoversuch that is not closed under crossover, andScheme Scheme .

V. PARTIAL MATCHING AND GENERALIZATION

This section is devoted to Scheme and Scheme underthe -chunks matching rule. These schemes both recognize thesets closed under crossover, but they have different propertiesin terms of implementation requirements. These differences in-clude distributivity, scalability, and algorithm efficiencies. Thissection explores the size of the detector set and determinesits expected size as a function of the number of strings in a ran-domly generated sample .

A. Expected Number of Unique Detectors Under -Chunks

Given a set , it is straightforward to compute exactly howmany detectors will be generated for maximal protection, bothfor the positive and negative detection schemes (Schemeand Scheme ) using the obvious generation method. ForScheme , it requires counting the number of distinct patternsfor each of the windows that comprise the strings in ,whereas for Scheme , enumerating the distinct patterns thatare not present in each window will result in the number ofdetectors. To provide an estimate of the average number ofdetectors, for both cases, as a function of the size of , we notethe following.

• The number of strings in with a specific pattern in anygiven window of size is .

• The probability of selecting at random one such string is.

• Assuming is small compared with , we approximatethe probability that a randomly generated self set of

unique strings will contain a specific pattern by. The

Fig. 2. Number of positive and negative detectors as a function of the size ofthe self set, assuming the self set was generated randomly. The plot shows bothcomplete and reduced detector sets.

equation is approximate because it considers trials to beindependent and sampling to take place with replacement.

• Following the previous item, the expected number of dis-tinct patterns , for a given window of size , is approx-imated by .

The following equation denotes the expected size of a set ofdetectors having the property that each member matches onewindow and that all windows are matched:

(1)

Conversely, the expected number of detectors possible for thenegative detection scheme (i.e. detectors that do not match anywindow pattern in ) is given by

(2)

To establish when one scheme requires a smaller set of de-tectors than the other for maximal coverage, we determine thenumber of strings in for which both schemes yield the samenumber of detectors, i.e., when (see Fig. 2):

Solving for

(3)

Although we are primarily concerned with binary strings, it isworthwhile to note that for an arbitrary size alphabet , theabove equation generalizes to , which affectsthe point at which one scheme is preferable to the other, in termsof the size of the detector set.

In a worst-case scenario, Scheme may require detec-tors when . Similarly, Scheme can also yield up to

detectors but only when there are no self strings whatsoever.1) Reduced Detector Set for Negative Detection: The

number of detectors needed in Scheme , to protect ,can actually be smaller than the full set described above sincesignificant redundancy amongst detectors may be present. We

ESPONDA et al.: FORMAL FRAMEWORK FOR POSITIVE AND NEGATIVE DETECTION SCHEMES 367

examine this property in detail for the case of a binary alphabet.Number the windows from left to right.

• Regardless of the composition of , a detector must begenerated for each pattern in the first window that is notpresent in .

• For every window of size , starting from thesecond window, and for every pair of patterns

, in sucha window, we have the following.

—If both and are present in self, then we cannotgenerate any detector for them.—If neither is present, then we need not generate de-tectors for them since the preceding window will bemissing its prefix, i.e., or ,and a detector in the previous window will also matchstrings with such a pattern.—If only one is present, say , then wemust generate detector since no string in

contains it.With this in mind, the average number of detectors needed in

the minimal set is given by

(4)

Following the previous rationale, we can also set an upperbound on the number of detectors required in the minimal set.The maximum number of self strings we can have without cre-ating crossovers, thereby reducing the number of required de-tectors, will exhibit only one of every pair of patterns

, for each window. This re-sults from a maximum of distinct self strings for a detectorset size of . Fig. 2 shows the plots of the expected numberof detectors for both the full and reduced detector sets.

2) Reduced Detector Set Size for Positive Detection: Usingsimilar reasoning to that of Section V-A1, we can find a sig-nificant amount of redundancy amongst detectors in the formof implicit matches. Consider the case where windowcontains patterns , ;then, it must be that window has either or both of the patternsending with bits , and therefore, a string matchedin window by one of these patterns will also be matched inwindow by either or . We call such a match an im-plicit match and eliminate and from the detector reper-toire. The number of detectors in the resulting set can be ex-pressed as

(5)

The size of the sample for which and yieldthe same number of detectors is the same as with the full reper-toire, i.e., and (3).

One subtlety about this analysis is that if not all positive detec-tors are represented explicitly, then some additional informationis required to identify implicit matches. This could be stored ex-plicitly, requiring at the very least one bit per implicit match, orit could be derived at runtime by determining, once a match (orimplicit match) at window has been established, if both and

are absent in window .3 A plot of for different

3If the information is stored explicitly, the extra bits can be “recovered” as adecrease in execution time.

Fig. 3. Crossover closure given a sample S of real self data RS.

sample sizes is presented in Fig. 2 without regard to the extrainformation required to determine implicit matches.

B. Crossover Closure and Its Expected Size

One useful way to visualize the generalization induced by the-chunks detectors is to construct DAG , where

is the number of distinct bit patterns of length over all windowsof strings in . Vertices are labeled by a pair ( , ), where

represents the window number, and stands for the bit pattern.is the number of edges, and edge

and the last bits of pattern match the firstbits of pattern , where . Under this

representation, the crossover closure is exactly the set of stringsformed by all traversals of the graph from levels 1 to .

Take, for instance, a self set comprised of the followingtwo strings with , (see Fig. 3).The corresponding graph is

and

. A visual depiction of the graph(omitting the window number from the vertex labels) is4

Recovering the strings by traversing the preceding graph, wefind that . In order to de-termine the size of for a randomly generated sample,we note that the number of substrings of length that containpattern in window is double the number of sub-strings of length that contain the pattern in the same windowif there are strings in that exhibit both and

in window and stays the same if exactlyone of these is present. In terms of our DAG representation, thenumber of paths that include a node with a given label doubles

4The labels for window numbers could be recovered from the level by usinga topological sort.

368 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

when such a node has two outgoing edges. We can write this asa recurrence on the number of windows :

if

otherwise.

Solving the recurrence yields:

(6)

where is the probability of a node having two outgoingedges. In order establish the value for , consider the fol-lowing gadget:

The probability for a given edge to be present is approximatedby and its absence by

.Given that the likelihood of a node having only one outgoing

edge is not independent of the probabilities related to the secondnode in the graph (a node at the same level differing only in thefirst bit position), we consider the probability of eithernode having one outgoing edge:

Similarly, for a node in the gadget to have no outgoing edges (orfor the node to be absent)

Finally, the probability of an individual node having two out-going edges is given by

The number of holes can be derived by simply subtractingfrom :

(7)

The size of the generalization can be calculated fromthe size of or from the size of the detector set (obtaining anestimate for from either (2) or (1) and substituting in (6). Itis important to note that the actual size will depend on the struc-ture of the specific self set. Nevertheless, the analysis providesinsight into its behavior (see Fig. 4) and enables us to ascertainthe impact of allowing novel strings into the sample. This can beuseful for determining, in a dynamic scenario, when (or at whatrate) should detectors be added or deleted from the working set.

VI. PERMUTATIONS AND DIVERSITY

In [8] and [14], an additional mechanism was introduced toimprove discrimination between self and nonself. This mecha-nism is loosely modeled after the diversity of major histocom-

Fig. 4. Number of holes as a function of self-sample size for two parametersettings. The graph on the left illustrates how fast the generalization increaseswhen r is small compared with the string length. The graph on the right showshow, after the generalization has reached its maximum, the number of holesslowly decreases as more strings are added to the sample S.

patibility complex (MHC) molecules used in the natural im-mune system. In the artificial setting, multiple permutations ofthe -length strings (both data and detectors) are stored. This,combined with the contiguous-bits match rules, provides thesystem with diversity of representation and improves the dis-crimination abilities of the system.

A. Permutation Maps for Reducing the Number of Holes

If , which is the set of holes, is too large, accurate dis-crimination between and will be unachievable.Hofmeyr introduced the concept of permutation masks as a wayto control the size of . A permutation mask is a reversiblemapping that specifies a reordering of bits for all strings in .In his network intrusion-detection data set, he reported an im-provement in detection ability by about a factor of three, whenpermutation masks were used.5 By controlling the number ofpermutations used, he was able to achieve greater control overthe -contiguous bits generalization. One interesting property ofthis transformation is that if we consider all nonredundant per-mutation masks, the -contiguous bits matching rule is able toprotect strictly more sets than Hamming distance.

Theorem 1.1: The class of languages recognized by -con-tiguous bits, with permutation masks, properly contains that rec-ognized by Hamming distance.

Proof:

• Let , be strings of length , where is a detector.• Let Rcb be a function

that returns the length of the longest run of contiguousbits that match between strings and . String is saidto match string if Rcb .

• Let be a functionthat returns the Hamming distance between strings and

. String is matched by a string using the Hammingdistance rule if and differ in, at most, bit positions,

.

5Hofmeyr’s dataset was collected using a variant of pure permutations thatwas computationally more efficient.

ESPONDA et al.: FORMAL FRAMEWORK FOR POSITIVE AND NEGATIVE DETECTION SCHEMES 369

• Let denote some permutation of string .• Let (Rcb) denote the class of languages recognized by

Rcb with permutation masks for a given and a detectorset .

• denote the class of languages recognized bygiven and a set of detectors .

The class of languages recognized by Rcb, with permutationmasks, contains the class of languages recognized by :

1) Rcb for all per-mutations , and hence, [ is a valid detector under ]implies [ matches no member of (Rcb) under Rcband any permutation ].

2) Further, if matches a string under , we show thereexists a permutation such that matches underRcb, that is, (Rcb). Label the bits that differbetween and as , andthe bits that match as , .We define as . Clearly, is avalid permutation, and Rcb . By (1)

is a valid detector for (Rcb)) that matches .Hence, if is a language recognized by some set of de-tectors under , we have shown how to construct anassociated set of permutations such that this same set ofdetectors recognizes exactly also under Rcb.

In contrast, does not recognize all languages recognizedby Rcb with permutation masks because a detector undermight match more than it matches under a single permutationand Rcb and thus fail to be a valid detector under .

Example: Let , , , and. Under Rcb, we can generate a valid detector

to match , whereas under , the potential detectors for are000, 001, 010, and 100, all of which have a Hamming distanceless than two from but also from all strings in . Therefore,

is undetectable under but detectable under Rcb with per-mutation masks; hence (Rcb) .

For those sets that Hamming distance can recognize, the Rcbrule may require multiple sets of detectors , each with its owntransformation . This mechanism is suitable for a detectionscheme that requires precise control over the generalization orthat must be distributed.

To illustrate how the permutation masks can in fact reducethe number of holes, consider the self strings and

with and . There are two crossoverstrings and other than and , for which no detectorcan be generated since neither string has a distinctive windowthat separates it from self:

If we apply the permutation by which the third bit position isnext to the first bit, as the following picture depicts, then we cangenerate two detectors and that will be able to matchand , respectively.

TABLE IUNDETECTABLE STRINGS UNDER EVERY PERMUTATION, FOR A GIVEN SET S

AND THE rcb MATCHING RULE

TABLE IIUNDETECTABLE STRINGS UNDER EVERY PERMUTATION, FOR A GIVEN SET

S AND THE r-CHUNKS MATCHING RULE

B. Completeness

We can now ask whether there are languages that cannot berecognized by rcb, even if permutation masks are used. That is,are there still strings in nonself that go undetectedeven if every permutation mask is employed? The answer to thisquestion is yes; as pointed out in [3], all practical match ruleswith a constant matching probability will exhibit holes.

We show, by construction, that there are languages that cannotbe recognized under rcb matching with permutations: Let

, and , and consider the hole. Table I lists all the possible permutations of and

the substrings out of which a detector could be constructed forunder each permutation.As can be seen from Table I, under each permutation, there are

no available templates (that is, templates that do not match somestring in ), and thus, neither rcb Scheme nor rcb Schemecan accept the strings in while excluding . Although we canconstruct such examples, it is difficult to characterize the set ofall such holes under permutations.

Lifting the restriction that detectors be of length enablesa -chunk detector to match the string from the previousexample. Nevertheless, there are languages that cannot berecognized under the -chunks matching rule (Scheme orScheme ) augmented with permutation masks. Consider thefollowing scenario: Let , , , and

. Table II lists all six permutations of and and thepossible -chunks detectors needed to match .

In general, a string cannot be detected by Scheme orScheme under any permutation if all of its template combi-nations are in self. The template combinations of a string are thetemplates (strings with unspecified bits) that result from all thepossible ways of specifying only bits. For instance, the tem-plate combinations of are , wheredenotes “don’t care.” It is easy to see that only if one such tem-plate does not exist in can there exist a permutation that placesthese bit positions contiguously, without creating the same con-tiguous bit pattern in .

370 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

VII. DISCUSSION

The preceding sections developed a formal framework forstudying tradeoffs between negative and positive detection andpresented several theoretical results that we believe are impor-tant to a wide variety of practical problems. In this section, weexplore some of the implications and potential extensions of ourtheoretical results.

A. Computer Security Applications

Over the past several years, a number of anomaly intru-sion-detection systems have been developed for computersecurity, which explore different instances of the detectionschemes described in this paper. These systems were reviewedin Section II; some use positive detection [2], [5], [9], andsome use negative detection [1], [8]. Likewise, some of thesesystems use -contiguous bits matching [1], [8], some use-chunks [11], some use -gram matching [5], and some use a

variant of -gram matching known as “lookahead pairs” [2],[9]. Although the negative-selection strategy of the immunesystem has received a great deal of attention, it should be notedthat the immune system also uses positive selection, combiningit advantageously with negative selection.

In some cases, we have good experimental evidence for pre-ferring one scheme over another. For example, -grams outper-formed lookahead pairs on our early data sets in terms of abso-lute discrimination ability (unpublished). However, when effi-ciency matters, as in the case of online detection, then the looka-head-pairs method is a clear winner, paying a small penaltyin terms of discrimination ability. A second example occurredwhen we compared rcb detection to -chunks on a network in-trusion-detection task [10], [11]. Such results are anecdotal inthe sense that they were obtained by experimentally testing oneor two methods on a limited number of data sets. Without sometheory to guide us, it is difficult to determine how much a givenresult depends on the particular data set used and how much itdepends on the choice of method. Likewise, it has been difficultto determine which aspects of a given detection scheme are mostresponsible for its success (or failure), e.g., negative detection,match rule, or parameter settings.

The results presented in this paper allow us to begin ap-proaching such questions from a theoretical perspective. Inparticular, the notion of a crossover closure, and its relation torcb and -chunks matching, will allow us to understand moredeeply when and why these matching methods are preferable tomore familiar methods such as Hamming distance. Likewise,the closed-form expressions for detector set sizes, both forpositive and negative detection, allow us for the first time topredict how large a problem must be before it pays to use anegative-detection scheme. Until now, the negative-detectionapproach has been somewhat of a curiosity, notorious asmuch for its immunological metaphor as for its demonstratedadvantages over other methods. Results such as those presentedhere will form the basis of a more objective evaluation.

B. Relational Databases

Section III introduced the relationship between crossover clo-sure and relational database theory. Specifically, we showed that

the crossover closure of a set of strings is equivalent to the nat-ural join of those strings (interpreted as tuples) projected ontoa relational decomposition scheme. There are several aspects ofthis equivalence, which we find interesting.

First, we can characterize when a set of strings is closed undercrossover closure in terms of well-studied structural propertiesknown as database dependencies [53], [63]. A set of stringsis closed under crossover closure exactly when it decomposeslosslessly into the relation schemes

induced by its -length windows, as we showed in Sec-tion III-A. Although the structural characterization of a losslessjoin decomposition is complex in the general case, the special-ized form of the above decomposition simplifies the conditionconsiderably; the characterization reduces to the adherenceto a collection of simple multivalued dependencies on theoriginal relation scheme . In order for thedecomposition to be lossless, one of the following multivalueddependencies on must hold for each :

or

Note that, for example, the multivalued dependencyasserts that if contains the tuples

and

then it must also contain the tuples

and

This condition describes the semantics of collections of in-formation that occur naturally within a vast number of diverseapplication domains. Whether such dependencies arise naturallyamong sets of strings we wish to protect remains to be investi-gated.

Taking the connection one step further, we can investigatethe correspondence between the schemes Scheme andScheme , which recognize the class of sets closed undercrossover closure and query systems for relational databases.In particular, if one is able recognize with a set of detectors(possibly significantly pruned to remove redundancy as de-scribed in Section V-A2) a given set closed under crossoverclosure, one can also represent by the same means any instanceof a relation scheme obeying the dependencies specified above.The negative detection scheme Scheme would be mostappropriate for membership queries, whereas Schemewould be appropriate for queries requiring enumeration of

ESPONDA et al.: FORMAL FRAMEWORK FOR POSITIVE AND NEGATIVE DETECTION SCHEMES 371

the member tuples. Conversely, known results from databasetheory regarding minimal representations translate to lowerbounds on time and space requirements for these detectionschemes.

C. Dynamic Samples

As mentioned in Section III, we expect our samples—andhence our distribution of the next packet generated by —tochange with time. Some factors contributing to this include thefollowing.

1) If is not in sample , it may become part ofsample , and sample then properly con-tains sample , and consequently, the distribution ofthe string generated at time differs from that at .This assumes we have a mechanism for deciding thatis generated by , despite the fact that we flag it as a lowprobability string (e.g., it passes some investigative step).

2) We might want to delete packets from the sample thathave not occurred recently. This might or might not alter

sample and, hence, the distribution of the stringgenerated at time .

Extending our analytical treatment to include dynamicallychanging samples raises a number of questions. In principle,when a new sample is created at time , a new set of detectorscould be generated de novo. However, this is likely to be ex-tremely inefficient. An area of future investigation is to analyzethe techniques used to update the detector set in response to thespecific types of perturbations in the sample.

D. Distance Measures

If we can define a metric over a match rule, then our detectionschemes can be rewritten in terms of a distance measure. Forinstance, using -chunks

• Scheme ;• Scheme ;• Scheme ;• Scheme ;

where is the number of windows that are presentin string but not in any string in . We are interested inexploring schemes that are intermediate between conjunctiveand disjunctive detection (both positive or negative). In theseintermediate schemes, a string may be treated as self, even ifnot all its window patterns have been observed before, i.e.,Scheme , where a is some threshold.Intuitively, strings that differ a small amount from the sampleare more likely to be a part of that those that differ a lot.Extending this idea one step further, one can imagine distancemeasures that do not assign a uniform value to every match butinstead take into account structural and statistical propertiesof the sample in order to weigh the relative merits of distinctmatches.

E. Implementation Issues

Although we have emphasized the representational power ofdifferent detection schemes in this paper, there are important

implementation considerations that affect the choice of amatch rule and detection scheme. Here, we briefly discussthe -chunks match rule and contrast it with rcb, although,as shown in Section IV-C, the class of languages recognizedby them are distinct. Nevertheless, the comparison highlightssome important properties of -chunks. We have thus fardescribed our detection schemes as a set of detectors, firstbecause we find it useful for visualizing and analyzing theirproperties and, second, because we consider that some applica-tions might benefit from distributing them throughout severalnodes; thus, we make no assumptions as to their location.Nevertheless, it is important to note that there is a large bodyof work [67] regarding string matching that suggests a moreefficient representation of detectors along with the algorithmsthat operate on them but impose additional restrictions on theease of distributing them. With this in mind, we consider threeimplementation issues: the cost of generating detectors, thecost of storing detectors, and the cost of locating detectors.

1) Detector Generation: Given a self set , a straightfor-ward method for generating the appropriate -chunks detectorsis to search the entire sample to determine which patterns arepresent in each window, requiring time andspace. Note that the detector set can be sorted during its gen-eration at no extra cost (using bin sort and assuming ).There have been several algorithms proposed for generating de-tectors for the rcb match rule [3], [13], [18], [19] that reportlinear generation time and extra space.

2) Detector Storage: Intuitively, -chunks requires morespace than rcb since every rcb detector ( bits each) can bedecomposed into -chunks detectors ( bits). Nevertheless,depending on the particular structure of , there is likely tobe a significant number of repeated patterns in each windowamongst the set of rcb detectors. Because -chunks representseach distinct pattern once, this could lead to smaller spacerequirements in some cases. In general, the space required for-chunks is .

3) Detector Location: Having detectors specific for eachwindow (as in -chunks) allows them to be stored in sortedorder for each window. Thus, checking to see if a string belongsto self or not requires time, whereas inserting or deletinga detector is just . There is also a potential that hashingschemes could reduce the search time.

F. Crossover Closure and Genetic Algorithms

There is a tantalizing connection between the crossover clo-sure described in this paper and the crossover operator oftenused in genetic algorithms. The crossover closure contains onlya subset of all possible one-point crossovers in the self set, andtherefore, the generalization achieved by -contiguous bits isnot identical with the space of coordinate hyperplanes definedby the crossover operator. Likewise, the -chunks matching ruleis nothing more than a restricted form of the schema notationoften used to explain genetic algorithm behavior. An additionalarea of future investigation is to make these connections moreprecise and to determine if the Schema theorem from geneticalgorithms bears any relation to the processing performed byeither of the contiguous bits match rules.

372 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 34, NO. 1, FEBRUARY 2004

VIII. CONCLUSION

In this paper, we presented a formal framework for analyzingdifferent positive and negative detection schemes in the contextof approximate matching. Although the framework we pre-sented is quite general, the primary application we have in mindis anomaly intrusion detection, for example, detecting anoma-lous TCP connections in a local area network or detectinganomalous patterns of system calls in executing processes. Wegave examples of how different partial matching rules fit intoour scheme, including Hamming distance, -contiguous bits,and -grams. We characterized the generalization induced by-contiguous bits (and its relative -chunks) by defining the

crossover closure of a sample of -length strings. Next, weshowed that the crossover closure is related to the concept of alossless join in relational database theory.

With this formal apparatus in place, we were then able togive some theoretical results on the relative power of the dif-ferent detection schemes and matching rules. In particular, weshowed that the -chunks match rule subsumes -contiguousbits matching, that (under rcb or -chunks) the class of lan-guages recognized by negative disjunctive detection is the sameas that of positive conjunctive detection, and that the crossoverclosure of a sample exactly characterizes the class of lan-guages recognized by negative disjunctive detection and posi-tive conjunctive detection (under -chunks matching).

Next, we considered the number of detectors that are requiredto provide maximal discrimination for fixed under Schemeand Scheme . We gave closed-form expressions for bothschemes, allowing us for the first time to estimate how large aself set must be before negative detection is a computationallyadvantageous strategy. We then discussed the expected size ofthe crossover closure and explored how the use of permutationsof the representation can reduce the size of the crossover clo-sure. Finally, we showed that the class of languages recognizedby -contiguous bits, augmented with permutation masks,properly contains that recognized by Hamming distance, andwe showed that even with the use of permutations, there aresome languages that cannot be recognized using -contiguousbits matching.

The theoretical results presented here are significant becausethey address a large and quickly expanding body of experi-mental work in intrusion detection that uses one of the detectionschemes. It is our hope that these theoretical results will makeit easier to understand the experimental results and to predictwhich approaches are most promising for which problems.

ACKNOWLEDGMENT

The authors would also like to thank J. Holland, A. Tyrrell,W. Langdon, and the anonymous reviewers, who’s commentsgreatly improved the presentation of this work.

REFERENCES

[1] S. Forrest, A. S. Perelson, L. Allen, and C. R.Cheru R. Kuri, “Self-non-self discrimination in a computer,” in Proc. IEEE Symp. Research Secu-rity Privacy, Los Alamitos, CA, 1994.

[2] S. Forrest, S. Hofmeyr, A. Somayaji, and T. Longstaff, “A sense of selffor Unix processes,” in Proc. IEEE Symp. Computer Security Privacy,1996.

[3] P. D’haeseleer, S. Forrest, and P. Helman, “An immunological approachto change detection: algorithms, analysis and implications,” in Proc.IEEE Symp. Computer Security Privacy, 1996.

[4] D. Dasgupta and S. Forrest, “Novelty detection in time series data usingideas from immunology,” in Proc. Int. Conf. Intelligent Systems, 1996.

[5] S. Hofmeyr, A. Somayaji, and S. Forrest, “Intrusion detection using se-quences of system calls,” J. Comput. Security, vol. 6, pp. 151–180, 1998.

[6] D. J. Smith, S. Forrest, D. H. Ackley, and A. S. Perelson, “Variable ef-ficacy of repeated annual influenza vaccination,” Proc. National Acad.Sciences (PNAS), vol. 96, pp. 14 001–14 006, 1999.

[7] S. Hofmeyr and S. Forrest, “Immunity by design: an artificial immunesystem,” in Proc. Genetic Evolutionary Computation Conf., San Fran-cisco, CA, 1999, pp. 1289–1296.

[8] , “Architecture for an artificial immune system,” Evol. Comput. J.,vol. 8, no. 4, pp. 443–473, 2000.

[9] A. Somayaji and S. Forrest, “Automated response using system-call de-lays,” in Proc. Usenix Security Symp., 2000.

[10] J. Balthrop, S. Forrest, and M. Glickman, “Revisting LISYS: parametersand normal behavior,” in Proc. Congr. Evolutionary Computation, 2002.

[11] J. Balthrop, F. Esponda, S. Forrest, and M. Glickman, “Coverage andgeneralization in an artificial immune system,” in Proc. Genetic Evolu-tionary Computation Conf., 2002.

[12] D. L. Chao and S. Forrest, “An aesthetic immune system (tentativetitle),” in Proc. Artificial Life VIII: The 8th Int. Conf. SimulationSynthesis Living Systems, May 2002.

[13] P. D’haeseleer, “An immunological approach to change detection: theo-retical results,” in Proc. 9th IEEE Computer Security Foundations Work-shop, 1996.

[14] S. Hofmeyr, An Immunological Model of Distributed Detection andits Application to Computer Security. Albuquerque, NM: Univ. NewMexico Press, 1999.

[15] P. D. Williams, K. P. Anchor, J. L. Bebo, G. H. Gunsch, and G. D. La-mont, “CDIS: Toward a computer immune system for detecting networkintrusions,” in Proc. Fourth Int. Symp., Recent Adv. Intrusion Detection,W. Lee, L. Me, and A. Wespi, Eds., Berlin, Germany, 2001, pp. 117–133.

[16] J. Kim and P. J. Bentley, “An evaluation of negative selection in an ar-tificial immune system for network intrusion detection,” in Proc. Ge-netic Evolutionary Computation Conf., San Francisco, CA, 2001, pp.1330–1337.

[17] S. T. Wierzchon, “Generating optimal repertoire of antibody strings inan artificial immune system,” in Intelligent Information Systems, M. A.Klopotek and M. Michalewicz, Eds. Heidelberg, Germany, New York:Physica-Verlag, 2000, pp. 119–133.

[18] , “Discriminative power of the receptors activated by k-contiguousbits rule,” J. Comput. Sci. Technol., vol. 1, no. 3, pp. 1–13, 2000.

[19] , “Deriving concise description of nonself patterns in an artificialimmune system,” in New Learning Paradigm in Soft Computing, S. T.Wierzchon, L. C. Jain, and J. Kacprzyk, Eds. Heidelberg, Germany,New York: Physica-Verlag, 2001, pp. 438–458.

[20] W. Lee and S. Stolfo, “Data mining approaches for intrusion detection,”in Proc. 7th USENIX Security Symp., 1998.

[21] “An agent based architecture for a computer virus immune system,”in Proc. GECCO Workshop Artificial Immune Syst., D. Dasgupta, Ed.,2000.

[22] G. B. Lamont, R. E. Marmelstein, and D. A. Van Veldhuizen, “A dis-tributed architecture for a self-adaptive computer virus immune system,”in New Ideas in Optimization. London, U.K.: McGraw-Hill, 1999, Ad-vanced Topics in Computer Science Series, pp. 167–183.

[23] M. Burgess, H. Haugerud, and S. Straumsnes, Measuring System Nor-mality I: Scales and Characteristics, 2000.

[24] C. Marceau, “Characterizing the behavior of a program using multiple-length N-grams,” in Proc. New Security Paradigms Workshop, Cork, Ire-land, 2000.

[25] M. J. Kearns and U. V. Vazirani, An Introduction to ComputationalLearning Theory. Cambridge, MA: MIT Press, 1994.

[26] J. C. Schlimmer, Concept Acquisition Through Representational Adjust-ment. Irvine, CA: Univ. California Press, 1987.

[27] D. Angulin, “Learning regular sets from queries and counterexamples,”Inform. Comput., vol. 75, pp. 87–106, 1987.

[28] T. Lane, Machine Learning Techniques for the Computer Security Do-main of Anomaly Detection. West Lafayette, IN: Purdue Univ. Press,Aug 2000.

ESPONDA et al.: FORMAL FRAMEWORK FOR POSITIVE AND NEGATIVE DETECTION SCHEMES 373

[29] Lincoln Laboratories. (1999) DARPA Intrusion Detection Evaluation.[Online]http://www.ll.mit.edu/IST/ideval/index.html

[30] D. W. Bradley and A. M. Tyrrell, “The architecture for a hardwareimmune system,” in The Third NASA/DoD Workshop on EvolvableHardware, D. Keymeulen, A. Stoica, J. Lohn, and R. S. Zebulum,Eds. Long Beach, CA: IEEE Computer Society Press, July 12–14,2001, pp. 193–200.

[31] , “A hardware immune system for benchmark state machine errordetection,” in Proc. Congr. Evolutionary Computation, Honolulu, HI,May 2002.

[32] , “Immunotronics: Novel finite state machine architectures withbuilt in Self test using Self-Nonself differentiation,” IEEE Trans. Evol.Comput., vol. 6, pp. 227–238, June 2002.

[33] D. Dasgupta and F. Gonzalez, “An immunity-based technique to char-acterize intrusions in computer networks,” IEEE Trans. Evol. Comput.,vol. 6, June 2002.

[34] S. Sathyanath and F. Sahin, “Artificial immune systems approach to areal time color image classification problem,” in Proc. IEEE Int. Conf.Syst., Man, Cybern., 2001.

[35] D. Q. Naiman, “Statistical anomaly detection via HTTPD data analysis,”Comput. Statist. Data Anal., to be published.

[36] K. Tan and R. Maxion, ““Why 6?” Defining the operational limits ofstide, and anomaly-based intrusion detector,” in Proc. IEEE Symp. Se-curity Privacy. Oakland, CA: IEEE Press, 2002.

[37] J. W. Kappler, N. Roehm, and P. Marrack, “T-cell tolerance by clonalelimination in the thymus,” Cell, vol. 49, pp. 273–280, Apr. 24, 1987.

[38] J. K. Percus, O. Percus, and A. S. Perelson, “Predicting the size of theantibody combining region from consideration of efficient self/nonselfdiscrimination,” in Proc. Nat. Acad. Sci., vol. 90, 1993, pp. 1691–1695.

[39] , “Probability of self-nonself discrimination,” in Theoretical andExperimental Insights into Immunology, A. S. Perelson and G. Weis-buch, Eds. New York: Springer-Verlag, 1992.

[40] P. F. Brown, V. J. Della Pietra, P. V. de Souza, J. C. Lai, and R. L. Mercer,“Class-based n-gram models of natural language,” in Proc. IBM NaturalLanguage ITL, Paris, France, 1990.

[41] D. E. Denning, “An intrusion-detection model,” IEEE Trans. Softw.Eng., vol. SE–2, p. 222, Feb. 1987.

[42] H. Teng, K. Chen, and S. Lu, “Adaptive real-time anomaly detectionusing inductively generated sequential patterns,” in Proc. IEEE Symp.Research Computer Security Privacy, Los Alamitos, CA, 1990.

[43] H. S. Javitz and A. Valdes, “The SRI IDES statistical anomaly detector,”in Proc. IEEE Symp. Research in Security and Privacy, 1991, pp.316–326.

[44] H. S. Javitz, A. Valdes, T. F. Lunt, A. Tamaru, M. Tyson, andJ. Lowrance, Next Generation Intrusion Detection Expert System(NIDES). Menlo Park, CA: SRI Int., 1993.

[45] P. Helman, G. Liepins, and W. Richards, “Foundations of intrusion de-tection,” in Proc. Fifth Computer Security Foundations Workshop, Fran-conia, NH, 1992, pp. 114–120.

[46] M. Roesch. SNORT. [Online]http://www.snort.org/[47] M. Damashek, “Gauging similarity with n-grams: Language-indepen-

dent categorization of text,” Science, vol. 267, pp. 843–848, 1995.[48] A. Arning, R. Agrawal, and P. Raghavan, “A linear method for deviation

detection in large databases,” in Proc. 2nd Knowledge Discovery DataMining, 1996, pp. 164–169.

[49] V. Barnett and T. Lewis, Outliers in Statistical Data. New York: Wiley,1994.

[50] E. Eskin, “Anomaly detection over noisy data using learned probabilitydistributions,” in Proc. 17th Int. Conf. Machine Learning, San Francisco,CA, 2000, pp. 255–262.

[51] D. H. Fisher, “Knowledge acquisition via incremental conceptual clus-tering,” Machine Learning, vol. 2, no. 2, pp. 139–172, 1987.

[52] D. Freedman, R. Psani, and R. Purves, Statistics. New York: W. W.Norton, 1978.

[53] H. Garcia-Molina, J. Ullman, and J. Widom, Database Systems: TheComplete Book. Englewood Cliffs, NJ: Prentice-Hall, 2001.

[54] S. J. Hanson and M. Bauer, “Conceptual clustering, categorization, andpolymorphy,” Machine Learning, vol. 3, no. 4, pp. 343–372, 1989.

[55] P. Helman and G. Liepins, “Statistical foundations of audit trail analysisfor the detection of computer misuse,” IEEE Trans. Softw. Eng., vol. 19,pp. 886–901, Sept. 1993.

[56] P. Helman and J. Bhangoo, “A statistically based system for prioritizinginformation exploration under uncertainty,” IEEE Trans. Syst., Man, Cy-bern., vol. 27, pp. 449–466, July 1997.

[57] P. Helman and R. Gore, “Prioritizing information for the discoveryof phenomena,” J. Intell. Inform. Syst., vol. 11, no. 2, pp. 99–138,Sept./Oct. 1998.

[58] D. Hoaglin, F. Mosteller, and J. Tukey, Understanding Robust and Ex-ploratory Data Analysis. New York: Wiley, 1983.

[59] A. Jain and R. Dubes, Algorithms for Clustering Data. EnglewoodCliffs, NJ: Prentice-Hall, 1988.

[60] L. Kaufman and P. Rousseeuw, Finding Groups in Data. New York:Wiley, 1990.

[61] E. Knorr and R. Ng, “A unified approach for mining: Properties andcomputation,” in Proc. 3rd Knowledge Discovery Data Mining, 1997,pp. 219–222.

[62] , “Algorithms for mining distance based outliers in largedatabases,” in Proc. 24th VLDB, 1998, pp. 392–403.

[63] D. Maier, The Theory of Relational Databases. New York: ComputerScience Press, 1983.

[64] R. Ng and J. Han, “Efficient and effective clustering methods for spacialdatamining,” in Proc. 20th VLDB, 1994, pp. 144–155.

[65] G. Salton and M. J. McGill, Introduction to Modern Retrieval. NewYork: McGraw-Hill, 1983.

[66] J. T. Tou and R. C. Gonzalez, Pattern Recognition Principles. Reading,MA: Addison Wesley, 1974.

[67] D. Gusfield, Algorithms on Strings, Trees, and Sequences: ComputerScience and Computational Biology. Cambridge, U.K.: CambridgeUniv. Press, 1997.

[68] B. Scholköpf and A. Smola, Learning with Kernels. Cambridge, MA:MIT Press, 2002.

[69] C.Carla Marceau. Characterizing the behavior of a program using mul-tiple-length n-grams. presented at Proc. New Security Paradigm Work-shop. [Online]ftp://ftp.oracorp.com/documents/MultiLengthStrings.pdf

Fernando Esponda received the degree in computer engineering from the In-stituto Tecnológico Autónomo de México (ITAM), Mexico City, in 1995 andwill soon receive the masters degree from the Instituto de Investigaciones enMatemáticas Aplicadas y Sistemas (IIMAS-UNAM), Mexico City. He is cur-rently pursuing the Ph.D. degree at the University of New Mexico, Albuquerque.

His research interests are in biologically inspired adaptive systems and ma-chine learning.

Stephanie Forrest received the B.A. degree from St. John’s College, Santa Fe,NM, and the M.S. and Ph.D. degrees from the University of Michigan, AnnArbor.

She is a currently a Professor of computer science at the University of NewMexico (UNM), Albuquerque, and a member of the residential faculty at theSanta Fe Institute. Before joining UNM, she was with Teknowledge Inc., PaloAlto, CA, and was a Director’s Fellow at the Center for Nonlinear Studies, LosAlamos National Laboratory, Los Alamos, NM. Her research interests are inadaptive systems, including genetic algorithms, computational immunology, bi-ological modeling, and computer security.

Paul Helman was born in Brooklyn, NY, in 1954. He received the B.A. degreefrom Dickinson College, Carlisle, PA, in 1976, the M.S. degree from StanfordUniversity, Stanford, CA, in 1977, and the Ph.D. degree from the University ofMichigan, Ann Arbor, in 1982.

Currently, he is a Professor of computer science at the University of NewMexico, Albuquerque. His research interests include bioinformatics, machinelearning, data mining, database theory, and combinatorial optimization. Healso has authored two computer science text books: (with R. Veroff) Walls andMirrors: Intermediate Problem Solving and Data Structures (Menlo Park, CA:Benjamin Cummings, 1986) and The Science of Database Management (BurrRidge, IL: Richard D. Irwin, 1994).