Robust and Scalable String Pattern Matching for Deep Packet Inspection on Multicore Processors

Robust and Scalable String PatternMatching for Deep Packet Inspection

on Multicore ProcessorsYi-Hua E. Yang and Viktor K. Prasanna, Fellow, IEEE

Abstract—Conventionally, dictionary-based string pattern matching (SPM) has been implemented as Aho-Corasick deterministic finite

automaton (AC-DFA). Due to its large memory footprint, a large-dictionary AC-DFA can experience poor cache performance when

matching against inputs with high match ratio on multicore processors. We propose a head-body finite automaton (HBFA), which

implements SPM in two parts: a head DFA (H-DFA) and a body NFA (B-NFA). The H-DFA matches the dictionary up to a predefined

prefix length in the same way as AC-DFA, but with a much smaller memory footprint. The B-NFA extends the matching to full dictionary

lengths in a compact variable-stride branch data structure, accelerated by single-instruction multiple-data (SIMD) operations. A branch

grafting mechanism is proposed to opportunistically advance the state of the H-DFA with the matching progress in the B-NFA.

Compared with a fully populated AC-DFA, our HBFA prototype has <1=5 construction time, requires <1=20 runtime memory, and

achieves 3x to 8x throughput when matching real-life large dictionaries against inputs with high match ratios. The throughput scales up

27x to over 34 Gbps on a 32-core Intel Manycore Testing Lab machine based on the Intel Xeon X7560 processors.

Index Terms—String matching, DFA, NFA, variable-stride tree, multicore platform, SIMD, deep packet inspection

Ç

1 INTRODUCTION

DEEP packet inspection (DPI) is a central component ofnetwork security systems where the contents of the

network traffic are continuously scanned for maliciousattacks. Examples include network intrusion detection [2],virus scanning [1], and content filtering [4]. String patternmatching (SPM) is the most widely used pattern matchingmechanism used by DPI to match a dictionary of stringsagainst a stream of characters. Due to the explosivegrowth of network bandwidth and types of attacks, SPMhas become a major performance bottleneck in DPIsystems [9].

Architecturally, SPM solutions can be categorized intotwo main groups: 1) software programs on multicoreplatforms; 2) hardware implementations on ASIC or FPGA.While software-based SPM may not be the most cost orenergy efficient, it has the following critical advantages:

. Modular: Integrate more easily with other functions.

. Extensible: Performance can be improved substan-tially by upgrading the processor or memory.

. Portable: Can be compiled for various number ofcores and even different processor families.

Most existing work on SPM focus on optimizing theAho-Corasick deterministic finite automaton (AC-DFA) [3],

whose performance tends to rely heavily on the slow-improving main memory speed. As a result, an AC-DFAusually underutilizes the computation capability of modernmulticore processors and exhibits SPM throughput that arehighly dictionary and input-dependent. In particular, whenthe dictionary is large and the input stream consists ofhigher ratios of dictionary substrings (the match ratio),performance of AC-DFA can degrade significantly due toincreased cache miss and memory pressure. Such beha-viors expose performance-based denial-of-service (DoS)attack opportunities in applications that utilize the SPMsolution [4].

The main contribution in this work is to propose,implement, and evaluate an alternative SPM architecture,head-body finite automaton (HBFA), for scalable throughputboth in number of concurrent threads and against inputswith high match ratio. In summary, an HBFA is composed oftwo parts: a “head” DFA (H-DFA) and a “body” NFA (B-NFA). The H-DFA consists of an AC-DFA constructed bythe set of compatible prefixes of the dictionary. The B-NFAstores the rest of the dictionary in an efficient variable-stridebranch data structure. Compared with original AC-DFA,HBFA takes less time to construct, requires a much smallermemory footprint, and offers higher and more attack-resilient throughput when matching large dictionaries.Other contributions include:

. We analyze the fundamental properties of SPM. Ouranalysis motivates the head-body partitioning.

. We define an algorithm to extract from any stringdictionary a parametrized set of compatible prefixes,which allows for efficiently constructing the H-DFA.

. We design an algorithm for constructing a B-NFA in acache-efficient branch data structure, accelerated withsingle-instruction multiple-data (SIMD) operations.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013 2283

. Y.-H.E. Yang is with Xilinx, Inc., Bldg 4, 2100 Logic Drive, San Jose,95124. E-mail: [email protected].

. V.K. Prasanna is with the University of Southern California, 3740McClintock Ave, EEB 200C, Los Angeles, CA 90089-2562.E-mail: [email protected].

Manuscript received 20 Sept. 2011; revised 3 July 2012; accepted 6 July 2012;published online 26 July 2012.Recommended for acceptance by C. Pinotti.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2011-09-0680.Digital Object Identifier no. 10.1109/TPDS.2012.217.

1045-9219/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

. We devise a branch grafting mechanism for efficientcooperation between the H-DFA and the B-NFA.

. The H-DFA, being itself a (smaller) AC-DFA, can beimproved by other AC-DFA optimization techniques.

Section 2 gives the background of AC-DFA and estab-lishes the theoretical foundation of HBFA. Section 3 andSection 4 describes the architecture and runtime algorithmof HBFA. Section 5 evaluates and compares the perfor-mance of HBFA. Section 6 concludes the paper.

2 BACKGROUND

2.1 Problems with AC-DFA-Based SPM

Matching high bandwidth byte streams against a largedictionary can be both compute and memory intensive. Intheory, the Aho-Corasick algorithm (AC-alg) [3] can be usedto construct a deterministic finite automaton (AC-DFA),which solves SPM with minimum asymptotic space andtime complexities [3]. (See Appendix I-A, which can befound on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.217, for amore detailed discussion on AC-DFA.) In practice, the statetransition table (STT) of an AC-DFA can become excessivelylarge and result in poor cache efficiency.

Assume an 8-bit alphabet is used and 32 bits are used toencode a state number. Each state will have 28 ¼ 256

outgoing transitions and occupy 256� 4 ¼ 1 KB memory.Large-scale SPM can require an STT with hundreds ofthousands of states, spanning hundreds of megabytes ofmemory space. In contrast, state-of-the-art multicore pro-cessors usually have �10 MB on-chip cache memory. As aresult, an input stream consisting of a minor portion of thedictionary can induce severe cache thrashing and drama-tically reduce the SPM throughput. We use match ratio toquantify the amount of dictionary substrings in the input.

Definition 1. The match ratio of a stream of input characters

with respect to a given dictionary is the percentage of the input

characters matching various significant prefix substrings of

the dictionary. A prefix substring is significant if it covers a

significant part (e.g., >80 percent) of the full string. The

prefixes are selected randomly to maximize the coverage on the

dictionary within a fixed input size.

Note that an input with high match ratio may not containany match to the full strings in the dictionary. By embeddingthe input packets with large number of random prefixsubstrings from the dictionary, it is possible to launch aperformance-based attack on the intrusion detection system(IDS) without actively triggering its defense mechanism.

Fig. 1 shows the throughput achieved by running a fullypopulated AC-DFA of the Snort [2] (June, 2009) dictionaryagainst inputs with various match ratios. Similar trend ofdecreasing throughput versus match ratio is observed in alllarge-scale dictionaries that we tested.

2.2 Tree Structure of SPM

An important feature of SPM is its dictionary (lexical) treestructure. A core metric of a tree is the minimum out-degrees of the tree nodes. A node is said to have a minimumout-degree d if it has at least d children. Therefore, a nodewith minimum out-degree d1 can require larger storagespace and result in higher traversal complexity than a nodewith minimum out-degree d2 if d1 > d2.

To better understand the tree structure of SPM, weanalyzed the minimum out-degrees of several dictionarytrees built from real-world DPI dictionaries (see Table 2 inSection 5). We found a “double power law” node distribu-tion in all the dictionary trees that we analyzed. Fig. 2shows the node distribution using the Snort dictionary as anexample. Specifically, the fraction of nodes at a fixed treelevel drops roughly cubically in increasing minimum out-degree (Fig. 2a) the fraction of nodes with a fixed minimumout-degree drops roughly quadratically in increasing treedepth (Fig. 2b).

The double power-law distribution suggests that,although most nodes in the dictionary tree above certaindepth (e.g., Level ¼ 6) can fit into a fixed node size(e.g., odg ¼ 4), the number of nodes that do not fit canstill be significant. There is no “clean” threshold (in eithernode size or tree depth) above which the fraction ofexception nodes (which do not fit into the fixed size) canbe statistically ignored.

With AC-DFA-based SPM, the dictionary tree structure isoften discarded in favor of a general DFA structure. Due tothe loss of each state’s depth information, a single (uniform)data structure is often used to store the entire AC-DFA. If alarge memory size per state is used, then much memory willbe wasted for the many small states; if a small memory sizeper state is used, then much design complexity andcomputation overhead will be needed to handle the largenumber of large states. Both result in inefficient SPMimplementations.

2.3 Toward Efficient Finite Automaton for SPM

Based on the analysis above, we propose an efficient finiteautomaton for SPM with the following characteristics:

1. The lowest few dictionary tree levels are visited mostfrequently; they are designed with simple access.

2. A variable data structure, optimized for widelyvariable state sizes, is used to store the high-levelstates.

3. The higher tree levels are compactly stored, togetherwith the AC-DFA failure function, in a tree structure.

2284 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

Fig. 1. Throughput versus input match ratio of AC-DFA for Snort on two

2.6 GHz AMD Opteron processors (“Shanghai”).

Item #1 is based on the fact that the lowest dictionary tree

levels consist of states with the highest out-degrees and the

largest number of incoming (cross) transitions [7]. Item #2 is

a direct consequence of Section 2.2. Item #3 aims to strike a

balance between memory and compute efficiency: the

dictionary tree constitutes the minimum amount of infor-

mation required for the SPM; whereas the AC-alg failure

function allows efficient matching of input against the

entire dictionary.The three characteristics above lead us to the head-body

partitioning and the design of HBFA for SPM, as detailed in

the next section.

3 CONSTRUCTION OF HBFA

3.1 Overall Procedure

The basic idea behind HBFA is to partition the dictionary into

two parts: a “head” and a “body.” The “head” part (H-DFA)

is built as an AC-DFA from a set of dictionary prefixes. The

“body” part (B-NFA) stores the rest of the dictionary in an

efficient NFA structure.The HBFA construction can be described as follows:

1. A set of compatible prefixes of the dictionary isidentified and used to build the H-DFA; each such

prefix produces a body roots in the H-DFA tointerface with the B-NFA (Section 3.2).

2. The B-NFA, constructed from the remaining dic-tionary suffixes, is organized as a forest of variable-stride body trees, each headed by a body root(Section 3.3).

3. The body trees are accessed and traversed in acompact branchdata structure (Section 3.4).

4. To reduce the SPM overhead induced by the body-to-head backtracking, each branch in the B-NFA isassociated with a branch grafting function, whichtransitions to an H-DFA state with minimum back-tracking when forward traversal in the B-NFA fails(Section 3.5).

3.2 Construction of Head DFA

Let S be the dictionary (set of strings) to match. A head DFA(H-DFA) for S is an AC-DFA constructed from a set ofcompatible prefixes of S. We say P is a set of compatibleprefixes of S if 1) every s 2 S has a prefix in P , and 2) anyp1 2 P which is a proper infix1 of some p2 2 P must also bea full string in S.

Definition 2. Given dictionary S, the following two propertiesdefine a set of compatible prefixes P of S:

1. 8s 2 S : 9p 2 P; l � lengthðsÞ 3 p ¼ prefixðs; lÞ2. 8p1; p2 2 P : p1 ¼ infixðp2; l1Þ ! p1 2 S

where prefixðt; iÞ and infixðt; iÞ denote the length-i prefix andinfix of string t, respectively.

Lemma 1. A simple way to obtain a set of compatible prefixes Pof dictionary S is as follows. Define some length lH .8s 2 S with lengthðsÞ > lH , add prefixðs; lHÞ to P ; 8s 2S with lengthðsÞ � lH , add s itself to P . The resulting P is aset of compatible prefixes of S.

Proof. Every string s 2 S has either its (proper) prefix p oflength lH or s itself added to P , satisfying Definition 2item #1. Suppose p1; p2 2 P and lengthðp1Þ < lengthðp2Þ.Since lengthðp2Þ � lH , lengthðp1Þ < lH , p1 must be a stringin S itself, satisfying Definition 2 item #2. tu

Given a compatible set of prefixes P of dictionary S, anH-DFA can be constructed by first applying AC-alg to P tobuild an AC-DFA, then for every p 2 P which is a properprefix of some s 2 S , marking the highest level stateconstructed from p as a body root. As we shall see inLemma 2 below, the set of compatible prefixes ensuresthat, given any input, at most one body root can become“active” in the H-DFA at any input character. Thisminimizes the overhead in initiating B-NFA instances fromH-DFA; it also allows efficient cooperation between H-DFAand B-NFA.

Definition 3. Let S be a dictionary and P a set of compatibleprefixes of S. The H-DFA H ¼ ðQH; q0;�; �H;MH;QRÞ is anAC-DFA constructed by running AC-alg on P :

1. QH is the set of AC-DFA states.2. q0 2 QH is the initial state.3. � is the alphabet used to compose S.4. �H : QH � �! QH is the state transition function.

YANG AND PRASANNA: ROBUST AND SCALABLE STRING PATTERN MATCHING FOR DEEP PACKET INSPECTION ON MULTICORE... 2285

1. We use infix to mean any continuous substring within a longer string.

Fig. 2. “Double power-law” distribution of the Snort dictionary tree. Ineach plot, the series marked with � is fit by the power regression shownin the lower left corner.

5. MH � QH is the set of match states. 8p 2 P ^ p 2 S,9mp 2MH such that the transition to mp signals amatch of p found in the input.

6. QR is the set of body roots

QR ¼4 fq 2 QH j gðq; �Þ ¼ � ^ q 62MHg

where gðq; �Þ is the set of all possible target states of theAC-DFA goto function at q.

The first 5 items above simply define an AC-DFAconstructed from P . Item #6 defines the body roots as thehighest level states constructed from all p 2 P , which areproper prefixes of some s 2 S.

Lemma 2. The set of body roots QR as defined in Definition 3satisfy the following property:

8q 2 QR; q0 2 QH : fðq0Þ 6¼ q;

where fð�Þ is the AC-DFA failure function.

Proof. As discussed above, Definition 3 item #6 specifiesany body root q is the highest level states constructedfrom some p 2 P , which is a proper prefix of some s 2 S.Since P is a compatible set of prefixes of S, any p 2 P thatis a proper prefix of some s 2 S cannot be an infix ofanother p0 2 P (the reverse of Definition 2 item #2). Thisimplies that q cannot be the failure state of any q0

constructed from p0. tu

3.3 Construction of Body NFA

Definition 4. LetS be a dictionary,Q, g, andM be the set of states,the goto function, and the set of match states of the AC-DFA ofS, and H ¼ ðQH; q0;�; �H;MH;QRÞ be an H-DFA of S. Thecorresponding B-NFA B ¼ ðQB;QR;�; gB;MB; �BÞ can befound by extending every QR to the higher level nodes in Q:

1. QB ¼ fQnQHg [QR is the set of body states.2. QR is the same set of body roots as in H.3. � is the alphabet used to compose S.4. gB : QB � �! fQBnQRg is the body goto function.8q 2 QB; c 2 � : gBðq; cÞ ¼ gðq; cÞ.

5. MB ¼ fMnMHg is the set of body match states.6. �B : QB ! Q [ f�g is the grafting function.

Topologically, a B-NFA is a forest consisting of multiplebody trees, each headed by a body root in QR. Every B-NFAstate corresponds to a node in one body tree; the state canbe reached from the body root through a series of calls tothe body goto function gB.

As its name implies, B-NFA is nondeterministic. This is aconsequence of the grafting function �B, which will be fullydefined in Section 3.5.

3.4 Mapping B-NFA to Branches

We design the branch data structure to store multiple B-NFAstates compactly together in memory. Every B-NFA branch�ðhÞ� is defined by a stemming state �, a branch height h, andthe body goto function gB.

Definition 5. Let B ¼ ðQB;QR;�; gB;MB; �BÞ be a B-NFAfrom Definition 4:

1. A branch �ðhÞ� of B with stemming state � 2 QB andheight h is defined as the set of B-NFA states

�ðhÞ� ¼4 [h

i¼1

gðiÞB ð�; �Þ;

where gð1ÞB ðq; �Þ ¼ fq0 j 9c 2 � : q0 ¼ gBðq; cÞg, and

gðiþ1ÞB ðq; �Þ ¼

[q02gðiÞ

Bðq;�Þ

gð1ÞB ðq0; �Þ:

We say � stems �ðhÞ� , or � � �ðhÞ� .2. The fan-outs �ðhÞ� of �ðhÞ� is defined as

�ðhÞ� ¼4 �

q 2 �ðhÞ� j q 2MB _ gð1ÞB ðq; �Þ \ �ðhÞ� ¼ ��:

3. The width of �ðhÞ� is defined as j�ðhÞ� j.For simple notation, we sometimes omit the superscript or

subscript of �ðhÞ� when the omission causes no ambiguity.

Informally, a B-NFA branch stemming from � with

height h consists of states on the paths of all possible stride-

h goto transitions starting from �.Claim 1. The following statements on branches and

stemming states are true for any B-NFA:

. The body roots are the lowest level stemming state.

. The stemming state of any branch � is either a bodyroot, or a fan out of the parent branch of �.

. A branch fan out is either a match state in the B-NFA, or the stemming state of a child branch, orboth.

. Every branch can be uniquely identified by itsstemming state, although technically � =2 �� .

Due to the underlying body tree structure, the width w of

a branch increases monotonically with respect to h; the rateof increase depends on the number of fan outs of the B-NFAstates. Given a fixed storage size k per branch, a body tree can

be packed into various branches by finding the maximum h

andw from each stemming state such that h� w � k, starting

with the body roots as the initial set of stemming states. Suchmapping transforms the body tree from a tree of variable-size nodes into a tree of fixed size but variable-stride

branches. As discussion in Section 2.2, such transformationwould allow the B-NFA to achieve excellent memory

efficiency for a wide range of dictionaries used by DPI.

3.5 B-NFA Grafting Function

We can now formerly define the B-NFA grafting function

first introduced in Definition 4.

Definition 6. Let B ¼ ðQB;QR;�; gB;MB; �BÞ be a B-NFA,QH

be the corresponding set of H-DFA states, Q ¼ QH [QR, and

f be the corresponding AC-DFA failure function. The grafting

function �B : QB ! Q [ f�g of B is defined as follows:

1. Map B into branches as described in Definition 5.2. 8q 2 QB; q

0 ¼ fðqÞ :

�BðqÞ ¼q0 9�; �0 : q � � ^ ðq0 2 QH _ q0 � �0Þ� otherwise:

�

ð1Þ

We say state q grafts to either state q0 or �, respectively.


According to (1), the B-NFA grafting function isessentially the AC-DFA failure function with the exceptionof mapping some states to � (nothing). Specifically,�BðqÞ ¼ fðqÞ only if both q and fðqÞ are stemming statesin the B-NFA, or q is a stemming state and fðqÞ belongs tothe H-DFA. A nonstemming state in B-NFA is never avalid source nor target state of the grafting function.

Because each stemming state uniquely identifies abranch in the B-NFA (see Claim 1 in Section 3.4), thefunction �B can be regarded as a branch grafting function.Suppose q0 ¼ �BðqÞ; if q � � and q0 � �0, then �B essentiallygrafts � to �0; if q � � but q0 2 QH , then �B grafts � back tothe H-DFA. In both cases, �B describes an �-transition fromevery B-NFA branch to a (lower level) branch or an H-DFAstate where an alternative point in the finite automaton(either H-DFA or B-NFA) can be reached by the input.

4 RUNTIME PROCESSING OF HBFA

In this section, we explain the architecture and runtimeprocessing of the proposed HBFA. An example showing thematching progress of a small HBFA is given in Appendix II,available in the online supplemental material. For conve-nience of presentation, the following definitions fromSection 3:

. Original AC-DFA: D ¼ ðQ; q0;�; �;M; �; g; fÞ

. H-DFA: H ¼ ðQH; q0;�; �H;MH;QRÞ

. B-NFA: B ¼ ðQB;QR;�; gB;MB; �BÞThe runtime processing of HBFA can be outlined in the

three steps below, each explained in a section:

1. The input characters are matched in the H-DFA untila body root is reached, where a B-NFA instance isinitiated to match the following input. (Section 4.1)

2. The input characters are compared with the pathlabels at each activated branch in the B-NFA. Ifthe path comparison leads to a fan-out state, then thechild branch stemming from the fan-out statebecomes the next active branch. (Section 4.2)

3. When no fan-out state in the active branch can bereached, the B-NFA either grafts to another B-NFAbranch, or to an H-DFA state. (Section 4.3)

Any match state visited (in either H-DFA or B-NFA)during the process above is reported as a match output.

4.1 Architecture of H-DFA

As discussed in Section 3.2, given a dictionary S, an H-DFAis essentially the AC-DFA constructed from a set ofcompatible prefixes of S. In our implementation, we usethe simple method described in Lemma 1 (Section 3.2) tofind the set of compatible prefixes, whose size can beconfigured by a single “prefix length” parameter.

To access the �H transitions, we assume a byte-orientedinput alphabet j�j ¼ 28 ¼ 256 and a fully populated STT.There are QHj j rows and �j j columns in the STT, one rowfor each q 2 QH and one column for each c 2 �. Every entryin the STT stores the next state number given the currentstate number (row index) and the next input character value(column offset). To encode up to 216 ¼ 65;536 states in amoderately large H-DFA, two bytes (16 bits) are needed foreach entry. The size of the STT is therefore QHj j � 512 bytes.

To easily identify the type of the H-DFA states, weorganize the state numbers into three ranges:

1. Internal state numbers: 0—ðn1 � 1Þ2. Match state numbers: n1—n013. Body root numbers: n2—n02State numbers in range 1 encode those H-DFA states that

are neither match states nor body roots (QHnðMH [QRÞ).State numbers in range 2 and range 3 encode thematch states (MH) and body roots (QR), respectively, inthe H-DFA. In general, n1 < n2 � n01 < n02; numbers betweenn2 and n01 represent states that are both match states andbody roots. There are no more than n02 states in the H-DFA.

Runtime algorithm of H-DFA is similar to that of AC-DFA. Specifically, at any time index t, the current state q 2QH and the next input character ctþ1 are used to access theSTT for the next state q0 ¼ �Hðq; ctþ1Þ. If q 2MH is reached attime t, then the state positionðq; tÞ is reported as a matchoutput from the H-DFA. It is important to note that the“time index” t here is defined as the current byte (character)offset in the input: a state position represents a “snapshot”of the H-DFA matching up to the input byte offset. Inparticular, t does not map to either the wall time or theprocessor cycle time.

Definition 7. A state position ðq; tÞ of a finite automaton withrespect to input stream ½c0; c1; . . . is reached when the state qof the finite automaton is reached by consuming the inputcharacters ½c0; . . . ; ct.

Different from AC-DFA, if some q 2 QR is reached inH-DFA at time t, then the state position ðq; tÞ is used toinitiate a B-NFA instance to continue the input matchingprocess; meanwhile, the H-DFA makes an �-transition toðfðqÞ; tÞ and become temporarily “frozen.” At some latertime, the B-NFA instance can transfer the string matchingprocess back to the H-DFA, either via the grafting functionto some “future” state position ðq0; t0Þ; t0 t, or simply toðfðqÞ; tÞ by default. The branch grafting process isexplained in Section 4.3.

Due to the additional “body root” state type and theoverlap between the match states and body roots, it takes upto three compare instructions to identify a state in H-DFA,compared to just one compare to find the match status inordinary AC-DFA. This overhead makes H-DFA slightlyslower than the best-case AC-DFA , although the fewcycles can be easily offset by the better cache performance ofH-DFA on a modern processor.

4.2 Architecture of B-NFA

As discussed in Sections 3.3 and 3.4, a B-NFA branch consistsof multiple body trees nodes packed in a variable-stride datastructure. We design a compact and SIMD-friendly archi-tecture to access every branch in 64 bytes of memory,matching the cache line size of most modern processors.

Table 1 describes the data structure of B-NFA branch indetail. In general, a branch of size k can have any height hand width w as long as h� w � k (see Section 3.4). Thisallows the branch to store up to w different goto transitionpaths from the branch’s stemming state of the branch, eachpath with a stride up to h characters. For simplicity andefficient processing, we assign to any B-NFA branch a


“type” with predefined h� w, as shown in the descriptionbelow Table 1. For example, a branch of type “4x8” canstore up to eight goto transition paths, each with a stride upto four characters. The branch height is also stored in theheight field for easy access. The nfouts field stores numberof full-length paths, which lead to the fan-out states of thebranch. All fan-out states of any branch are numberedconsecutively, with the first (lowest) fan-out state numberstored in the next_state field.

Up to 32 bytes of path labels, derived from the gototransitions between states within the branch, are stored inthe two path_labels fields; the exact arrangement of thepath labels depends on the branch type. The input_msk isused to mask bytes read from the input stream whencomparing them with a path of stride less than the branchheight. For example, a branch of type “4x8” has the twopath_labels fields logically partitioned into eight chunks of4 bytes each, while some 4-byte chunk can actually store apath of only 3 (or fewer) bytes long. The mat_msk masks thecomparison result for match outputs; a nonzero bit inmat_msk indicates the corresponding byte in path_labels ison a path that comprises a dictionary match.

The input stream can be compared with all paths storedin the path_labels field(s) in parallel using SIMD instruc-tions.2 To do so, the B-NFA first reads h input bytes from theinput stream and replicates them w times to form an inputvector of length h� w (both h and w are known from thebranch type). The resulting input vector should be either16 or 32 bytes long, matching the length of one or twopath_labels fields, respectively. A number of SIMD instruc-tions are then performed on the input vector and thepath_labels field(s) in the following steps:

1. Mask the input vector by input_msk and compare itwith the path_labels field(s).

2. Find the fan-out state, which can be reached bymatching the input bytes to the path labels.

3. Extract any branch path that comprises a dictionarymatch and has path labels matching the input bytes.

There can be at most one fan-out state found in step 2above, which causes the B-NFA to traverse to the branch

stemming from the fan-out state. On the other hand, twopaths can have path labels matching the input bytes if one isa prefix of the other. All the branch paths extracted in step 3above are recorded and reported as match output at thebranch. Due to the body tree structure, each B-NFA branchcan have only one previous (parent) branch. A corollary tothis property is that all the fan-out states of any branch canbe numbered sequentially, allowing the respective childrenbranches to be placed consecutively in memory. Theaddress of any child branch can be found simply by takinga nonnegative offset to the state number of the branch’s firstfan-out state (i.e., the next_state field).

On average, the processing in each branch consists of 25-35 instructions (SIMD or not), taking 20-30 clock cycles on amodern superscalar processor. While this is not much fasterthan a hash table lookup, we note that most branches haveheights from 4 to 16 bytes, allowing the B-NFA to betraversed in multibyte strides very often and with goodmemory and cache efficiency.

4.3 Head-Body Cooperation through BranchGrafting

A B-NFA can work either independently or cooperativelywith the H-DFA. Suppose a B-NFA instance is initiated atsome state position ðr; tÞ; r 2 QR. An independent B-NFAstarts its matching process at branch �r (stemming from r)and input byte offset tþ 1. A forward branch transition ismade when the following input bytes match the pathlabels of a full-length path in the branch. The B-NFAinstance continues to make forward branch transitionsuntil at some branch the input bytes do not match the pathlabels of any full-length path. Then, the B-NFA instanceterminates and the H-DFA is reactivated at ðfðrÞ; tÞ, asdiscussed in Section 4.1.

A cooperative B-NFA makes forward branch transitions inthe same way as an independent B-NFA. In addition, atevery branch a cooperative B-NFA also updates a local“grafting point” variable, gp, which specifies the bestalternative branch (state) position in the B-NFA (H-DFA)where the string matching process can continue when nomore forward transition can be made at the current branch:

1. When the B-NFA instance is initiated at stateposition ðr; tÞ; r 2 QR, gp ¼4 ðfðrÞ; tÞ.

2. Suppose branch �� stemming from � 2 QB is reachedafter consuming input characters ½ctþ1; . . . ; ct0 ; let �Bbe the grafting function:

a. If �Bð�Þ ¼ �, then gp is unchanged.b. If �Bð�Þ 6¼ �, then gp ¼4 ð�Bð�Þ; t0Þ.

Note that either �Bð�Þ 2 QR or �Bð�Þ 2 QHnQR.3. Suppose ð�0; t0Þ is the latest gp, and no forward

branch transition can be made in the B-NFA:


TABLE 1Data Structure of B-NFA Branch

type ðbÞ: One of the following: {1x256, 1x32, 1x16, 2x16, 2x8, 4x8, 4x4,8x4, 8x2, 16x2, 16x1, 32x1}height ðhÞ: Maximum path stride in the branch: {1, 2, 4, 8, 16, 32}nfouts ðnÞ: No. fanout states leading to a valid child branch.next_state: State number of the first fanout to a valid child branch.graft_state: State number of the grafting function target; -1 if �.mat_msk: Bit-mask of the match states. One bit for each byte in thepath_labels vector.input_msk: Byte-mask of the valid input characters to be compared withby the path_labels vector.path_labels: Up to 32 bytes of path labels in branch.

2. Our HBFA prototype was implemented on the x86-64 instruction setwith the Streaming SIMD Extensions (SSEx); similar extensions can befound in other instruction sets as well.

TABLE 2Dictionary Statistics

a. If �0 2 QHnQR, then the B-NFA instance termi-nates and the H-DFA is reactivated at ð�0; t0Þ.

b. If �0 2 QB, then the B-NFA transitions to branch��0 to match the input at offset t0 þ 1.

By following the grafting function, the terminating B-NFA instance does not always need to “roll back” to theoriginal state position ðfðrÞ; tÞ where the H-DFA was“frozen.” Instead, the grafting point ð�0; t0Þ is used to eithertransition the B-NFA to ��0 (if �0 2 QB) or reactivate the H-DFA at �0 (if �0 2 QHnQR). Since in both cases t0 > t, theoverall string matching progress is advanced even when noforward path is found in the B-NFA.

Note that the independent B-NFA can be seen as aspecial case of the cooperative B-NFA where all branches inthe B-NFA have grafting targets set to �, thus the graftingpoint, once initialized at ðfðrÞ; tÞ, remains ðfðrÞ; tÞ till theend of the B-NFA instance.

5 PERFORMANCE EVALUATION

5.1 Environments and Data Sets

5.1.1 Platform and Implementation

We implemented HBFA for x86-64 with 128-bit SSE2/SSE3extensions. The source is written in C/C++/SIMD assem-bly, compiled with GCC 4.2 and the “-O2” flag, and run on64-bit Linux with kernel version 2.6. We measured theperformance of our program on a dual-socket server withtwo AMD Opteron 2,382 processors (quad-core “Shanghai”at 2.6 GHz) and 16 GB DDR2-667 main memory; we alsomeasured scalability in number of threads on a 32-core IntelManycore Testing Lab machine based on Intel Xeon X7560processors (8-core “Nehalem” at 2.26 GHz).3

We implemented three versions of SPM finite automa-ton: a fully populated AC-DFA, an HBFA without head-body cooperation, and an HBFA with head-body coopera-tion. The same data structure is used by the two HBFAversions; their only difference is in whether the graftingpoint is updated and utilized in the B-NFA. For fairevaluation, all dictionary matches found by the finiteautomaton are stored in an output queue, which growsdynamically to accommodate large number of matches. Onthe other hand, we did not apply any hashing orcompression to the AC-DFA for the following reasons:

1. Many hashing and compression techniques on AC-DFA improve the throughput performance only forspecific types of dictionaries or input streams.

2. Any technique that could improve the performanceof AC-DFA can also be applied to the H-DFA andimprove the overall performance of HBFA.

A fair comparison of various STT hashing and compres-sion techniques requires a detailed study of their perfor-mance characteristics, which is out of the scope of this paper.

5.1.2 Dictionaries

We use three dictionaries from popular intrusion detection(Snort) and virus scanning (ClamAV) applications. Table 2

shows the statistics of the dictionaries. Snort-0906 repre-sents a dictionary with many relatively short strings,whereas ClamAV type-1 and type-3 represent dictionarieswith long strings. All three dictionaries are “binary,” i.e.,they utilize the full 8-bit alphabet.

5.1.3 Input Match Ratio

The input match ratio is controlled by embedding properprefixes of the dictionary strings randomly into an “in-nocent” data stream. For the Snort dictionary, we use theplain text from an HTML-formatted King James Bible as theinnocent data stream; for the ClamAV dictionaries, we usethe concatenation of all files under /usr/sbin in a typicalLinux server installation as the innocent data stream.

Our experiments show that the throughput of AC-DFAand HBFA against the innocent data streams is about thesame as that against a “clean” plain-text network tracecollected on a typical university workstation. This isexpected since the (clean) transfers on such a workstationconsist mostly of files of the same nature as our innocentdata streams. In contrast, matching a data stream of random8-bit characters results in slightly (�5 percent) higherthroughput with AC-DFA than with HBFA. When a largeror more complex dictionary is used, the matching through-put against the innocent data stream can drop significantly(see comparison between Fig. 4 and Fig. 5), but thematching throughput against the random data streamremains high. We use the innocent data stream forperformance evaluation because it better captures therealistic behaviors of DPI.

To construct data streams with different levels of attackactivities, we embed various amount of significant prefixesfrom each dictionary into the innocent data stream. Thelevel of attack activity is described by the input match ratio,as defined Definition 1 in Section 2.1. A 1 percent (0.01)match ratio could represent an occasional occurrence ofintrusion-like pattern in the network traffic. A 10 percent orhigher match ratio could represent a scenario where10 percent or more network packets are sent by the attackerto slow down the system. Since only proper prefixes of thedictionary strings are embedded to the data streams, nomatch output is generated and the DPI system detects noabnormality other than the lowered matching throughput.

5.1.4 Parallelism on Multicore

To utilize all available cores in the multicore platforms, ourprogram runs multiple threads (up to the total number ofprocessor cores in the system), each with one instance ofHBFA or AC-DFA, matching the same dictionary. Togenerate meaningful and realistic results, each threadaccesses its own input stream in a separate memorylocation. All threads share the same dictionary datastructure, although each thread may (and usually will)access a different state due to the different input streamsthey process. The aggregated throughput over all parallelthreads is used to measure the overall system performance.

Future multicore processors are expected to haveincreasing number of cores or hardware threads. Thisparallel processing approach allows us to scale the numberof software threads easily to the number of cores in ashared-memory multicore system. In some scenarios, it maybe preferable to match different dictionaries on various


3. Intel “Nehalem” and later processors feature Hyper-Threading andTurbo Boost technologies, which can speed up both HBFA and AC-DFA.The speedup may be subject to additional variables such as L1D contentionand core temperature. A fair assessment of these technologies is out of thescope of this paper.

cores in the system. We note that doing so would reduce theeffectiveness of the on-chip shared cache and give thememory-efficient HBFA even greater performance advan-tage over AC-DFA.

5.2 Memory Efficiency and Construction Time

For each dictionary, we constructed the AC-DFA as well asthree HBFAs with various head sizes for performanceevaluation. As shown in Table 3, HBFA effectivelycompacts every dictionary into a much smaller memorythan AC-DFA.

5.2.1 Memory Efficiency

State-of-the-art multicore processors can execute instruc-tions much faster than they can access high-level caches andoff-chip memory. It is advantageous to store the most-frequently accessed part of the dictionary into a private andlowest level cache to the processing core. Thus memoryefficiency is critical to the throughput scalability (in bothnumber of cores and dictionary size) of the SPM solution.

Memory footprint of an H-DFA is directly proportionalto the number of states in the H-DFA. Because by design anH-DFA can have at most 65,536 states, the state numbers ofH-DFA occupies only 2 bytes each; this makes H-DFA twiceas memory efficient as an AC-DFA whose state numbersoccupies 4 bytes each. Note that any AC-DFA compressiontechnique is by definition applicable to H-DFA; in all cases,keeping the total number of DFA states below 64K alwaysimproves memory efficiency.

The B-NFA compacts up to 32 body tree nodes into a64-byte branch, resulting in a memory footprint as small as2 B/state (bytes per state). In practice, the memoryfootprints of B-NFA vary from 2.4 B/state to 5.2 B/statefor the dictionaries evaluated in this study.

5.2.2 Construction Time

The construction time is another important performancemetric of SPM, especially for DPI. Most dictionaries ofcomputer patterns (protocols, viruses, field values, etc.) aredynamic. An optimized SPM solution usually requiresreconstruction for every dictionary update. A more efficientconstruction allows the SPM to sustain higher update rate,reduce update downtime, and require less resource for theupdate engine.

As shown in Table 3, HBFA takes 1/3 to 1/10 time toconstruct than AC-DFA. To fully populate an AC-DFA, allstate transitions must be traversed at least once, a processwhich can be slow due to the irregular memory accessesover a relatively large memory space. The HBFA construc-tion, in contrast, compacts the dictionary tree in a depth-first manner while constructing the B-NFA. Althoughpotentially more operations are needed, the operationsare performed with much higher spatial and temporallocality over a much smaller memory space than populat-ing the AC-DFA.

5.3 Throughput

Figs. 3, 4, and 5 show the matching throughput of HBFA forthe Snort, ClamAV type-1, and type-3 dictionaries, respec-tively, on the dual quad-core Opteron 2382 platform. Foreach dictionary, three head sizes (as in Table 3) are used forthe HBFA construction to show the effect of head size oninput resilience and matching throughput. With each headsize, the “HBFA_coop” line plots the throughput of HBFAwith branch grafting; the “HBFA_ind” line plots thethroughput without branch grafting. The “AC-DFA” line inevery subfigure also plots the (same) AC-DFA throughput.

Compared with AC-DFA, the matching throughput ofHBFA is much more resilient to high match ratio in the inputstream. The head size of HBFA offers a tradeoff between inputresilience and best case throughput: a smaller H-DFA makesthe HBFA more resilient to high match ratio, whereas a largerH-DFA slightly increases the best case throughput at lowmatch ratio. In practice, the head size that achieves the“optimal” tradeoff will be selected for implementation. For allthree dictionaries examined, a good tradeoff is obtained whenthe H-DFA consists of approximately the lowest six levels of


Fig. 3. Throughput of HBFA for Snort with 6,204, 21,564 and 41,706 H-DFA states (three, six, and 10 levels).

TABLE 3Memory Efficiency and Construction Time

the AC-DFA. This corresponds to 21,564 H-DFA states forSnort (Fig. 3(middle)), 21,905 for ClamAV type-1(Fig. 4(middle)), and 6,347 for ClamAV type-3 (Fig. 5(left))dictionaries.

HBFAs with and without branch grafting have similarmatch ratio resilience. Branch grafting allows the B-NFA towork cooperatively with the H-DFA in advancing thematching progress. As a result, an HBFA with branchgrafting always outperforms one without branch graftingmatching the same dictionary. The difference is slightlygreater for the ClamAV dictionaries where (on average)longer string suffixes are matched in the B-NFA.

Fig. 6 shows the throughput speedup of HBFA withbranch grafting over AC-DFA for all three dictionaries. Witha properly chosen head size, HBFA with branch graftingachieves 1:2�� 1:6� the throughput of AC-DFA whenmatching inputs with 1 percent match ratio, 1:6�� 3� with4 percent match ratio, and 3�� 8�with 16 percent matchratio. At zero match ratio where AC-DFA runs optimallywith few cache misses, the HBFA achieves over 99 percentthe throughput of AC-DFA for the Snort and ClamAV type-1 dictionaries, and 85 percent for the ClamAV type-3dictionary. We notice that the dictionary tree constructedwith the ClamAV type-3 dictionary has much fewer nodesin the lowest tree levels. Therefore, when the input has zero

match ratio (hence, containing no significant match to any

dictionary string), the AC-DFA would stay mostly within a

small set of states near the root (initial) state, resulting

in good cache performance. However, the throughput of

AC-DFA drops quickly below HBFA with only a 1 percent

match ratio, where the AC-DFA transitions occasionally to a

state farther away from the root state.Fig. 7 shows the aggregated throughput on the quad

octo-core Xeon X7560 system, scaling from 1 to 28 threads4

while matching the largest dictionary (ClamAV type-1).

Both HBFA and AC-DFA scale well for all input match

ratios from 0.01 up to 0.16. This implies that the throughput

degradation of AC-DFA at high input match ratios is not

due to insufficient memory bandwidth but rather long

memory access latency. The good throughput scaling of all

HBFA configurations also shows that the performance of

HBFA is not sensitive to the (effective) on-chip cache size,

since in the experiment system, all 28 threads compete for

the 24 MB on-chip (level-3) cache.


Fig. 5. Throughput of HBFA for ClamAV type-3, with 6,347, 22,859 and 4,1276 H-DFA states (six, 14, and 22 levels).

Fig. 6. Speedup of HBFA with branch grafting over AC-DFA for the three dictionaries.

Fig. 4. Throughput of HBFA for ClamAV type-1, with 7,799, 21,905, and 41,403 H-DFA states (three, six, and 10 levels).

4. The system is accessed remotely and shared by a large community. Weinstantiated only 28 threads in the 32-core system to mitigate the effect of anoccasional shared access that we do not control.

6 CONCLUSION AND FUTURE WORK

We proposed HBFA, a robust and scalable SPM solution.Unlike AC-DFA, HBFA does not suffer from significantperformance degradation when processing input with highmatch ratios. For large dictionaries and a moderate(16 percent) match ratio, HBFA achieves 3�� 8� matchingthroughput, requires <1=20 runtime memory and <1=5construction time of an AC-DFA.

Currently, head-body partitioning is performed staticallywhere the head part is bounded by a predefined depthvalue. In general, all body roots in the H-DFA do not needto have the same depth. One future direction is to design analgorithm that efficiently constructs an H-DFA with bodyroots at varying depths, for example, to include longerprefixes for “hot” dictionary strings. The B-NFA processingmay be further accelerated by the upcoming AVX and XOPx86-64 extensions or the OpenCL/CUDA capable GPGPUs.

ACKNOWLEDGMENTS

This work was supported by the US National ScienceFoundation (NSF) under grant CCF-1018801.

REFERENCES

[1] “Clam AntiVirus,” http://www.clamav.net/, 2013.[2] “SNORT,” http://www.snort.org/, 2013.[3] A.V. Aho and M.J. Corasick, “Efficient String Matching: An Aid to

Bibliographic Search,” Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975.[4] S. Antonatos, K.G. Anagnostakis, E.P. Markatos, and M.

Polychronakis, “Performance Analysis of Content MatchingIntrusion Detection Systems,” Proc. Int’l Symp. Applications andthe Internet, Jan. 2004.

[5] S. Antonatos, K.G. Anagnostakis, and E.P. Markatos, “GeneratingRealistic Workloads for Network Intrusion Detection Systems,”SIGSOFT Software Eng. Notes, vol. 29, no. 1 pp. 207-215, 2004.

[6] M. Becchi, C. Wiseman, and P. Crowley, “Evaluating RegularExpression Matching Engines on Network and General PurposeProcessors,” Proc. ACM/IEEE Symp. Architecture for Networking andComm. Systems (ANCS’09), 2009.

[7] W. Jiang, Y.-H.E. Yang, and V.K. Prasanna, “Scalable Multi-Pipeline Architecture for High Performance Multi-Pattern StringMatching,” Proc. IEEE Int’l Parallel and Distributed Processing Symp.(IPDPS), Aug. 2010.

[8] D.P. Scarpazza, O. Villa, and F. Petrini, “High-Speed StringSearching against Large Dictionaries on the Cell/B.E. Processor,”Proc. IEEE Int’l Parallel and Distributed Processing Symp., 2008.

[9] L. Tan and T. Sherwood, “Architectures for Bit-Split StringScanning in Intrusion Detection,” IEEE MICRO, vol. 26, no. 1,pp. 110-118, Jan./Feb. 2006.

[10] N. Tuck, T. Sherwood, B. Calder, and G. Varghese, “DeterministicMemory-Efficient String Matching Algorithms for IntrusionDetection,” Proc. IEEE INFOCOM, vol. 4, pp. 2628-2639, Mar. 2004.

[11] J. van Lunteren, “High-Performance Pattern-Matching for Intru-sion Detection,” Proc. IEEE INFOCOM, 2006.

[12] G. Vasiliadis, S. Antonatos, M. Polychronakis, E.P. Markatos, andS. Ioannidis, “ Gnort: High Performance Network IntrusionDetection Using Graphics Processors,” Proc. 11th Int’l Symp.Recent Advances in Intrusion Detection, vol. 5230, pp. 116-134, 2008.

[13] G. Vasiliadis and S. Ioannidis, “GrAVity: A Massive ParallelAntivirus Engine,” Proc. 13th Int’l Symp. Recent Advances inIntrusion Detection,, 2010.

[14] O. Villa, D. Chavarria, and K. Maschhoff, “Input-Independent,Scalable and Fast String Matching on the Cray XMT,” Proc. IEEEInt’l Parallel and Distributed Processing Symp., 2009.

Yi-Hua E. Yang received the BSc degree inelectrical engineering from the National TaiwanUniversity, the MSc degree in electrical andcomputer engineering from the University ofMaryland, College Park, and the PhD degree inelectrical engineering from the University ofSouthern California. He is a Staff ResearchEngineer at Xilinx Research Labs, San Jose,CA. He was a research assistant at the MingHsieh Department of Electrical Engineering,

University of Southern California, from 2008 to 2011. His researchinterests include algorithmic optimization on parallel architectures, high-performance architecture designs for network processing, security, andcryptography.

Viktor K. Prasanna (http://ceng.usc.edu/~prasanna) received the BS degree in electro-nics engineering from the Bangalore University,the MS degree from the School of Automation,Indian Institute of Science, and the PhD degreein computer science from the PennsylvaniaState University. He is a Charles Lee Powellchair in engineering in the Ming Hsieh Depart-ment of Electrical Engineering and professor ofcomputer science at the University of Southern

California. His research interests include high-performance computing,parallel and distributed systems, reconfigurable computing, and em-bedded systems. He is the executive director of the USC-Infosys Centerfor Advanced Software Technologies (CAST) and is an associatedirector of the USC-Chevron Center of Excellence for Research andAcademic Training on Interactive Smart Oilfield Technologies (Cisoft).He also serves as the director of the Center for Energy Informatics atUSC. He served as the editor-in-chief of the IEEE Transactions onComputers during 2003-2006. He is currently the editor-in-chief of theJournal of Parallel and Distributed Computing. He was the foundingchair of the IEEE Computer Society Technical Committee on ParallelProcessing. He is the steering co-chair of the IEEE International Paralleland Distributed Processing Symposium (IPDPS) and is the steeringchair of the IEEE International Conference on High PerformanceComputing (HiPC). He is a recipient of the 2009 OutstandingEngineering Alumnus Award from the Pennsylvania State University.He is a fellow of the IEEE, the ACM, and the American Association forAdvancement of Science (AAAS).


Fig. 7. Throughput scaling of HBFA (lines with various hsize) and AC-DFA (lines with size ¼ 48; 4460) in number of threads on 4-socket Intel XeonX7560 (up to 28 threads in a total of 32 cores).

Documents

Robust and Scalable String Pattern Matching for Deep Packet Inspection on Multicore Processors