14
982 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 22, NO. 3, JUNE 2014 High-Throughput and Memory-Ef cient Multimatch Packet Classication Based on Distributed and Pipelined Hash Tables Yang Xu, Member, IEEE, Zhaobo Liu, Zhuoyuan Zhang, and H. Jonathan Chao, Fellow, IEEE Abstract—The emergence of new network applications, such as the network intrusion detection system and packet-level ac- counting, requires packet classication to report all matched rules instead of only the best matched rule. Although several schemes have been proposed recently to address the multimatch packet classication problem, most of them require either huge memory or expensive ternary content addressable memory (TCAM) to store the intermediate data structure, or they suffer from steep performance degradation under certain types of classiers. In this paper, we decompose the operation of multimatch packet classication from the complicated multidimensional search to several single-dimensional searches, and present an asynchronous pipeline architecture based on a signature tree structure to com- bine the intermediate results returned from single-dimensional searches. By spreading edges of the signature tree across multiple hash tables at different stages, the pipeline can achieve a high throughput via the interstage parallel access to hash tables. To exploit further intrastage parallelism, two edge-grouping algo- rithms are designed to evenly divide the edges associated with each stage into multiple work-conserving hash tables. To avoid collisions involved in hash table lookup, a hybrid perfect hash table construction scheme is proposed. Extensive simulation using realistic classiers and trafc traces shows that the proposed pipeline architecture outperforms HyperCuts and B2PC schemes in classication speed by at least one order of magnitude, while having a similar storage requirement. Particularly, with different types of classiers of 4K rules, the proposed pipeline architecture is able to achieve a throughput between 26.8 and 93.1 Gb/s using perfect hash tables. Index Terms—Hash table, packet classication, signature tree, ternary content addressable memory (TCAM). I. INTRODUCTION A S THE Internet continues to grow rapidly, packet clas- sication has become a major bottleneck of high-speed routers. Most traditional network applications require packet Manuscript received April 23, 2012; revised March 09, 2013; accepted May 21, 2013; approved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor P. Crowley. Date of publication July 22, 2013; date of current ver- sion June 12, 2014. This work was presented in part at the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), Princeton, NJ, USA, October 19–20, 2009. Y. Xu and H. J. Chao are with the Department of Electrical and Computer En- gineering, Polytechnic Institute of New York University, Brooklyn, NY 11201 USA (e-mail: [email protected]; [email protected]). Z. Liu is with Vmware, Inc., Palo Alto, CA 94304 USA (e-mail: zliu01@poly. edu). Z. Zhang is with Teza Technologies, New York, NY 10036 USA (e-mail: [email protected]) Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TNET.2013.2270441 classication to return the best or highest-priority matched rule. However, with the emergence of new network applications like Network Intrusion Detection System (NIDS), packet-level ac- counting [1], and load balancing, packet classication is re- quired to report all matched rules, not only the best matched rule. Packet classication with this capability is called multi- match packet classication [1]–[5] to distinguish it from the conventional best-match packet classication. Typical NIDS systems, like SNORT [6], use multimatch packet classication as a preprocessing module to lter out benign network trafc and thereby reduce the rate of suspect trafc arriving at the content matching module [7], [8], which is more complicated than packet classication, and usually cannot run at the line rate in the worst-case situation. As a preprocessing module, packet classication has to check every incoming packet by comparing elds of the packet header against rules dened in a classier. To avoid slowing down the performance of the NIDS system, packet classication should run at the line rate in spite of the classiers and trafc patterns. Many schemes have been proposed in literature aiming at optimizing the performance of packet classication in terms of classication speed and storage cost. However, most of them focus on only the best-match packet classication [9]–[11]. Although some of them could also be used for multimatch packet classication, they suffer from either a huge memory requirement or steep performance degradation under certain types of classiers [12], [13]. Ternary content addressable memory (TCAM) is well known for its parallel search capa- bility and constant processing speed, and it is widely used in IP route lookup and best-match packet classication. Due to the limitation of its native circuit structure, TCAM can only return the rst matching entry, and therefore cannot be directly used in multimatch packet classication. To enable the multimatch packet classication on TCAM, some research works published recently [2]–[5] propose to add redundant intersection rules in TCAM. However, the introduction of redundant intersection rules further increases the already high implementation cost of the TCAM system. The objective of this paper is to design a high-throughput and memory-efcient multimatch packet classication scheme without using TCAMs. Given the fact that a single-dimensional search is much simpler and has already been well studied, we decompose the complex multimatch packet classication into two steps. In the rst step, single-dimensional searches are per- formed in parallel to return matched elds in each dimension. In 1063-6692 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

06565409

Embed Size (px)

DESCRIPTION

High-Throughput and Memory-Efficient Multimatch,Packet Classification Based on Distributed andPipelined Hash Tables

Citation preview

  • 982 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 22, NO. 3, JUNE 2014

    High-Throughput and Memory-Efficient MultimatchPacket Classification Based on Distributed and

    Pipelined Hash TablesYang Xu, Member, IEEE, Zhaobo Liu, Zhuoyuan Zhang, and H. Jonathan Chao, Fellow, IEEE

    AbstractThe emergence of new network applications, suchas the network intrusion detection system and packet-level ac-counting, requires packet classification to report all matched rulesinstead of only the best matched rule. Although several schemeshave been proposed recently to address the multimatch packetclassification problem, most of them require either huge memoryor expensive ternary content addressable memory (TCAM) tostore the intermediate data structure, or they suffer from steepperformance degradation under certain types of classifiers. Inthis paper, we decompose the operation of multimatch packetclassification from the complicated multidimensional search toseveral single-dimensional searches, and present an asynchronouspipeline architecture based on a signature tree structure to com-bine the intermediate results returned from single-dimensionalsearches. By spreading edges of the signature tree across multiplehash tables at different stages, the pipeline can achieve a highthroughput via the interstage parallel access to hash tables. Toexploit further intrastage parallelism, two edge-grouping algo-rithms are designed to evenly divide the edges associated witheach stage into multiple work-conserving hash tables. To avoidcollisions involved in hash table lookup, a hybrid perfect hashtable construction scheme is proposed. Extensive simulation usingrealistic classifiers and traffic traces shows that the proposedpipeline architecture outperforms HyperCuts and B2PC schemesin classification speed by at least one order of magnitude, whilehaving a similar storage requirement. Particularly, with differenttypes of classifiers of 4K rules, the proposed pipeline architectureis able to achieve a throughput between 26.8 and 93.1 Gb/s usingperfect hash tables.

    Index TermsHash table, packet classification, signature tree,ternary content addressable memory (TCAM).

    I. INTRODUCTION

    A S THE Internet continues to grow rapidly, packet clas-sification has become a major bottleneck of high-speedrouters. Most traditional network applications require packet

    Manuscript received April 23, 2012; revised March 09, 2013; acceptedMay 21, 2013; approved by IEEE/ACM TRANSACTIONS ON NETWORKINGEditor P. Crowley. Date of publication July 22, 2013; date of current ver-sion June 12, 2014. This work was presented in part at the 5th ACM/IEEESymposium on Architectures for Networking and Communications Systems(ANCS), Princeton, NJ, USA, October 1920, 2009.Y. Xu and H. J. Chao are with the Department of Electrical and Computer En-

    gineering, Polytechnic Institute of New York University, Brooklyn, NY 11201USA (e-mail: [email protected]; [email protected]).Z. Liu is with Vmware, Inc., Palo Alto, CA 94304USA (e-mail: zliu01@poly.

    edu).Z. Zhang is with Teza Technologies, New York, NY 10036 USA (e-mail:

    [email protected])Color versions of one or more of the figures in this paper are available online

    at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TNET.2013.2270441

    classification to return the best or highest-priority matched rule.However, with the emergence of new network applications likeNetwork Intrusion Detection System (NIDS), packet-level ac-counting [1], and load balancing, packet classification is re-quired to report all matched rules, not only the best matchedrule. Packet classification with this capability is called multi-match packet classification [1][5] to distinguish it from theconventional best-match packet classification.Typical NIDS systems, like SNORT [6], use multimatch

    packet classification as a preprocessing module to filter outbenign network traffic and thereby reduce the rate of suspecttraffic arriving at the content matching module [7], [8], whichis more complicated than packet classification, and usuallycannot run at the line rate in the worst-case situation. As apreprocessing module, packet classification has to check everyincoming packet by comparing fields of the packet headeragainst rules defined in a classifier. To avoid slowing down theperformance of the NIDS system, packet classification shouldrun at the line rate in spite of the classifiers and traffic patterns.Many schemes have been proposed in literature aiming at

    optimizing the performance of packet classification in terms ofclassification speed and storage cost. However, most of themfocus on only the best-match packet classification [9][11].Although some of them could also be used for multimatchpacket classification, they suffer from either a huge memoryrequirement or steep performance degradation under certaintypes of classifiers [12], [13]. Ternary content addressablememory (TCAM) is well known for its parallel search capa-bility and constant processing speed, and it is widely used in IProute lookup and best-match packet classification. Due to thelimitation of its native circuit structure, TCAM can only returnthe first matching entry, and therefore cannot be directly usedin multimatch packet classification. To enable the multimatchpacket classification on TCAM, some research works publishedrecently [2][5] propose to add redundant intersection rules inTCAM. However, the introduction of redundant intersectionrules further increases the already high implementation cost ofthe TCAM system.The objective of this paper is to design a high-throughput

    and memory-efficient multimatch packet classification schemewithout using TCAMs. Given the fact that a single-dimensionalsearch is much simpler and has already been well studied, wedecompose the complex multimatch packet classification intotwo steps. In the first step, single-dimensional searches are per-formed in parallel to return matched fields in each dimension. In

    1063-6692 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • XU et al.: HIGH-THROUGHPUT AND MEMORY-EFFICIENT MULTIMATCH PACKET CLASSIFICATION 983

    the second step, a well-designed pipeline architecture combinesthe results from single-dimensional searches to find all matchedrules. Simulation results show that the proposed pipeline archi-tecture performs very well under all tested classifiers and is ableto classify one packet within every 210 time-slots, where onetime-slot is defined as the time for onememory access. Ourmaincontributions in this paper are summarized as follows.1) We model the multimatch packet classification as a con-catenated multistring matching problem, which can besolved by traversing a signature tree structure.

    2) We propose an asynchronous pipeline architecture to ac-celerate the traversal of the signature tree. By distributingedges of the signature tree into hash tables at differentstages, the proposed pipeline can achieve a very highthroughput.

    3) We propose two edge-grouping algorithms to partition thehash table at each stage of the pipeline into multiple work-conserving hash tables, so that the intrastage parallelismcan be exploited. By taking advantage of the properties ofthe signature tree, the proposed edge-grouping algorithmsperform well in solving the location problem, overheadminimization problem, and balancing problem involved inthe process of hash table partition.

    4) We propose a hybrid perfect hash table constructionscheme, which can build perfect hash tables for each stageof the pipeline structure, leading to an improved perfor-mance in both classification speed and storage complexity.

    The rest of the paper is organized as follows. Section IIreviews the related work. Section III formally defines the mul-timatch packet classification problem and presents terms to beused in the paper. Section IV introduces the concept of signa-ture tree, based on which Section V proposes an asynchronouspipeline architecture. Section VI presents two edge-groupingalgorithms that are used to exploit intrastage parallel query.Section VII presents the hybrid perfect hashing scheme. InSection VIII, we discuss implementation issues and present theexperimental results. Finally, Section IX concludes the paper.

    II. RELATED WORKMany schemes have been proposed in literature to address

    the best-match packet classification problem, such as Trie-basedschemes [9], [14], decision tree-based schemes [11], [12], [15],TCAM-based schemes [16][24], and two-stage schemes [13],[14], [19], [25].Due to the high power consumption of TCAM, some

    schemes are proposed to reduce it using one of the followingtwo ideas: 1) reducing the TCAM entries required to repre-sent a classifier by using range encoding [19][21] or logicoptimization [22], [23]; or 2) selectively activating part of theTCAM blocks when performing a classification [24]. Althoughthese schemes reduce the power consumption of TCAM, theycannot be directly applied to multimatch packet classification.To enable the multimatch packet classification on TCAM,[1] proposes several schemes that allow TCAM to return allmatched entries by searching the TCAM multiple times afteradding a discriminator field in TCAM. Consequently, the powerconsumption and processing time increase linearly when thenumber of entries matching a packet increases. According to

    Fig. 1. Segment encoding versus range encoding.

    the results in [1] based on 112 access control lists (ACLs) andthe SNORT rule set, the average number of entries that onepacket can potentially match is about 45. Thus, the powerconsumption and processing time of multimatch packet classi-fication can be 5 times higher than that of the best-match packetclassification. Some research works published recently [2][5]propose to add redundant intersection rules in TCAM to supportmultimatch packet classification. However, the introduction ofredundant intersection rules further increases the already highimplementation cost of the TCAM system.In this paper, we focus on the two-stage schemes, in which the

    multidimensional search of packet classification is first decom-posed into several single-dimensional searches, and then the in-termediate results of single-dimensional searches are combinedto get the final matched rule. To facilitate the combination op-eration, each field of rules in the two-stage schemes is usuallyencoded as either a range ID or several segment IDs. Considerthe classifier shown in Fig. 1, which has three two-dimensionalrules, each represented by a rectangle. Ranges are defined as theprojections of the rectangles along a certain axis. For example,the projections of rules R1, R2, and R3 along axis X form threeranges denoted by X_RG1, X_RG3, and X_RG2, respectively.In contrast, segments are the intervals divided by the boundariesof projections.With the segment encoding method, each rule is repre-

    sented by multiple segment ID combinations, which maycause a serious storage explosion problem [13], [14]. Severalschemes [19], [25] have been proposed to address the storageexplosion problem by using TCAM and specially designedencoding schemes. However, the use of TCAM increasesthe power consumption and implementation cost, and moreimportantly, they can only be used for the best-match packetclassification.With the range encoding method, the representation of each

    rule requires only one range ID combination, and therefore thestorage explosion problem involved in the segment encoding isavoided. The low storage requirement comes at a price of slow

  • 984 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 22, NO. 3, JUNE 2014

    TABLE ICLASSIFIER WITH SEVEN RULES

    TABLE IICLASSIFIER AFTER RANGE ENCODING

    query speed, which prevents the range encoding method frombeing used in practical systems. To the best of our knowledge,the only published two-stage classification scheme using rangeencoding is B2PC [26], which uses multiple Bloom filters toaccelerate the validation of range ID combinations. In order toavoid the slow exhaustive validation, B2PC examines range IDcombinations according to a predetermined sequence and re-turns only the first matched range ID combination, which maynot always correspond to the highest-priority matched rule dueto the inherent limitation of the B2PC scheme. Furthermore,B2PC cannot support multimatch packet classification.

    III. PROBLEM STATEMENTA classifier is a set of rules, sorted in descending order

    of priorities. The priorities of rules are usually defined by theirrule IDs, where a smaller rule ID means a higher priority. Eachrule includes fields, each of which represents a range of a cer-tain dimension. From a geometric point of view, each rule repre-sents a hyper-rectangle in the -dimensional space. Since eachpacket header corresponds to a point in the -dimensionalspace, the problem of conventional best-match packet classifica-tion is equivalent to finding the highest-priority hyper-rectangleenclosing point , while the problem ofmultimatch packet clas-sification is equivalent to finding all hyper-rectangles enclosingpoint .In order to perform the multimatch packet classification ef-

    ficiently, given a classifier, we convert it to an encoded coun-terpart by assigning each distinct range a unique ID on each di-mension. Given the classifier in Table I, its encoded counterpartis shown in Table II, in which is the ID of the th uniquerange appearing on the th dimension of the classifier.Given a packet header and an encoded classifier withdimensions, the multimatch packet classification scheme

    proposed in this paper consists of two steps. In the first step,relevant fields of the packet header are each sent to a single-di-mensional search engine, where either prefix-based matching orrange-based matching will be performed to return all matched

    TABLE IIIPACKET TO BE CLASSIFIED

    TABLE IVRANGE IDS RETURNED BY SINGLE-DIMENSIONAL SEARCHES

    range IDs. Consider a packet header given in Table III: Therange IDs returned from five single-dimensional search enginesare shown in Table IV and can formdifferent range ID combinations. Since we have no idea inadvance of which combinations among the 36 appear in theencoded classifier, we have to examine all 36 combinations,without exception, in the second step to return all valid com-binations. Since the single-dimensional search problem hasbeen well addressed in literature [27], in this paper we focuson only the second step. In the remainder of the paper, packetclassification will specifically refer to this second step unlessspecial notation is given.If we view each range ID as a character, the multimatch

    packet classification problem could be modeled as a con-catenated multistring matching problem. In this problem,the encoded classifier could be regarded as a set of stringswith characters. From the encoded classifier, we can getuniversal character sets, each of which includes characters

    in one column of the encoded classifier. The set of range IDsreturned by each single-dimensional search engine is called amatching character set, which is a subset of the correspondinguniversal character set. The concatenated multistring matchingproblem is to identify all strings in the encoded classifier thatcould be constructed by concatenating one character fromeach of matching character sets. The main challenge of theconcatenated multistring matching problem is to examine alarge number of concatenated strings at an extremely highspeed to meet the requirement of high-speed routers.

    IV. SIGNATURE TREETo facilitate the operation of concatenated multistring

    matching, we present a data structure named signature tree tostore strings in the encoded classifier. Fig. 2 shows a signa-ture tree corresponding to the encoded classifier in Table II.Each edge of the tree represents a character, and each noderepresents a prefix of strings in the encoded classifier. TheID of each leaf node represents the ID of a rule in the en-coded classifier. The concatenated multistring matching canbe performed by traversing the signature tree according to theinputs of matching character sets. If any of the matchingcharacter sets is empty, the result will be NULL. Otherwise,the matching is performed as follows. At the beginning, onlythe root node is active. The outgoing edges of the root node areexamined against the characters in the first matching character

  • XU et al.: HIGH-THROUGHPUT AND MEMORY-EFFICIENT MULTIMATCH PACKET CLASSIFICATION 985

    Fig. 2. Example of signature tree.

    set. Each time when a match is found, the corresponding node(at level one) pointed by the matched edge will be activated.After the examination, the root node is deactivated, and one ormultiple nodes (if at least one matching edge is found) at levelone become active. Then, the active nodes at level one willbe examined one by one against the characters in the secondmatching character set. A similar procedure will repeat toexamine characters in the remaining matching character sets.In this procedure, active nodes move from low levels to highlevels of the signature tree, and eventually IDs of the activeleaf nodes represent the matched rules.The traversal complexity of the signature tree depends on

    many factors, including the size of each matching character set,the number of active nodes at each level when the signature treeis being traversed, as well as the implementation method of thesignature tree. One way of implementing the signature tree is tostore each node as a whole data structure and connect parent andchild nodes together by points in the parent nodes. However,our analysis on the real classifiers shows that the numbers ofoutgoing edges of nodes have a very large deviation (due to theinherent nonuniform distribution of field values in classifiers),which makes the design of a compact node structure a very chal-lenging task. Even if we came up with a compact node structureusing the pointer compressing scheme [28], incremental updatesand query operations on the signature tree would become ex-tremely difficult. Therefore, in this paper, rather than storingeach node as a whole structure, we break up the node and storeedges directly in a hash table. More specifically, each edge onthe signature tree takes one entry of the hash table in the form ofsource node ID : character, destined node ID . Here, source

    node ID : character means the concatenation of source nodeID and character in binary mode and works as the key of thehash function, while destined node ID is the result we hope toget from the hash table access.Given the fact that the memory access speed is much slower

    than logic, the processing speed of a networking algorithm isusually determined by the number of memory accesses (refer-ences) required for processing a packet [29]. The memory ac-

    Fig. 3. Pipelining architecture for packet classification.

    cesses that occur during the signature tree traversal are causedby the hash table lookup. Therefore, given a certain hash tableimplementation, the number of times that the hash table mustbe accessed to classify a packet is a good indicator for the pro-cessing speed of the signature tree-based packet classification.In the following sections, we divide the hash table into multiplepartitions to exploit parallel hash table access and thus improvethe performance. Here, we introduce two properties about theuniversal character set and the signature tree, which will be usedlater.Property 1:Characters in each universal character set can be

    encoded as any bit strings as long as there are no two characterswith the same encoding.Property 2: Nodes on the signature tree can be given any IDs,

    as long as there are no two nodes with the same IDs at the samelevel.

    V. ASYNCHRONOUS PIPELINE ARCHITECTURETo improve the traversal speed of the signature tree, we sep-

    arate the signature tree into partitions and store edges ofeach partition into an individual hash table. More specifically,the outgoing edges of level- nodes are storedin hash table . An example of the signature tree after partition isshown in Fig. 2, which has four partitions. The correspondinghash tables of these four partitions are shown in Fig. 3. It isworth noting that outgoing edges of the root node are not stored.This is because the root node is the only node at level 0, and eachof its outgoing edges corresponds to exactly one character in thefirst universal character set. According to Property 1, we can en-code each character of the first universal character set using theID of the corresponding destined level-1 node. For instance, inFig. 2, we can let and . Hence, given thefirst matching character set, we can immediately get the IDs ofactive nodes at level 1.For a -dimensional packet classification application, we pro-

    pose an asynchronous pipeline architecture with stages.Fig. 3 gives an example of the proposed pipeline architecturewith . It includes processing modules (PMs). EachPM is attached with an input Character FIFO (CFIFO), an inputActive node FIFO (AFIFO), an output AFIFO, and a hash table.

  • 986 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 22, NO. 3, JUNE 2014

    Fig. 4. Format of entries in CFIFO/AFIFO.

    Each CFIFO supplies the connected PM with a set of matchingcharacters returned by the single-dimensional search engine.Each AFIFO delivers active node IDs between adjacent PMs.Each hash table stores edges in a certain partition of the signa-ture tree.Since each packet may have multiple matching characters/

    active nodes at a stage of the pipeline, two bits in each entry ofCFIFO/AFIFO are used to indicate the ownership of matchingcharacters/active nodes, as shown in Fig. 4. An S bit set to 1means that the entry is the first matching character/active nodeof a packet, while an E bit set to 1 means that the entry is thelast matching character/active node of a packet. If both S andE bits are set to 1, it means that the entry is the only matchingcharacter/active node of a packet.When a packet is going to be classified, the relevant fields

    of the packet header are first sent to single-dimensional searchengines. Each search engine returns a set of matching charac-ters representing the matched ranges on the corresponding di-mension to the attached CFIFO (the first search engine returnsmatching characters to the attached AFIFO 1). If no matchingcharacter is found, a NULL character encoded as all 0 isreturned.In the pipeline, all PMs work in exactly the same way, there-

    fore we focus on a certain PM and considerthe procedure that a packet is being processed at PM .Suppose that packet has active node IDs in AFIFO ,

    which are denoted by , and matching charac-ters in CFIFO , which are denoted by . The pro-cessing of packet at PM consists of the processing of ac-tive node IDs. In the processing of each active node ID, say, PM takes out the matching character from

    the attached CFIFO and concatenates each of them (if the char-acter is not NULL) to to form hash keys and access theattached hash table for times. Results from the hash table in-dicate the IDs of s child nodes, which will be pushed intothe output AFIFO when it is not full. If the output AFIFO iscurrently full, the push-in operation along with the operation ofPM will be suspended until one slot of the output AFIFO be-comes available.During the processing of packet , if PM cannot find any

    match in the hash table, or if thematching character of the packetis NULL, PM will push a NULL node ID encoded as all 0into the output AFIFO to notify the downstream PMs that thepacket does not match any rule.The number of hash table accesses required by PM to

    process packet is equal to the product of the active nodenumber and the matching character number of packet , i.e.,

    in this case, if we omit the overhead caused by the hashcollision.

    VI. INTRASTAGE PARALLEL QUERYThe asynchronous pipeline architecture introduced above de-

    ploys one hash table at each stage. The processing of each activenode ID at each PM may involve multiple accessing of the hashtable. To accelerate the processing of each active node ID, weplan to further partition the hash table at each stage to exploitintrastage parallelism. After the intrastage partition, each PMmight be associated with multiple hash tables, which can be ac-cessed in parallel. For easy control and avoidance of the packetout-of-sequence, each PM will process active node IDs in thestrict serial way. That is, if there is an active node ID currentlybeing processed (some hash tables are therefore occupied), theprocessing of the next active node ID cannot be started, even ifthere are hash tables available to use.Before introducing schemes for the intrastage hash table par-

    tition, we present several concepts, among which the conceptof independent range set is similar to but not exactly the sameas the concept of independent rule set proposed by Sun et al.in [30].Definition 1: Independent Ranges: Let and

    be two ranges on a dimension. is called independent of if.

    Definition 2: Independent Range Set: Let be a set of ranges.is called an independent range set if any two ranges in are

    independent.Definition 3: Independent Characters: Let and be two

    characters associated with range and , respectively. iscalled independent of if is independent of .Definition 4: Independent Character Set: Let be a set of

    characters. is called an independent character set if any twocharacters in are independent.Definition 5: Independent Edges: Suppose : ,and : , are two edges in a certain partition

    of the signature tree. is called dependent on if andis dependent on ; otherwise, is called independent of .Definition 6: Independent Edge Set: Let be a set of edges

    in a certain partition of the signature tree. is called an inde-pendent edge set if any two edges in are independent.Definition 7: Work-Conserving Hash Tables: Suppose we

    have hash tables associated with PM of the pipeline, wherean active node ID, say , is being processed. We say thesehash tables are work-conserving for processing if no hashtable is left idle when there are matching characters associatedwith waiting for query; in other words, we can alwaysfind a free hash table in which an unqueried edge1 of isstored if not all hash tables are occupied. Hash tables associatedwith PM are called work-conserving hash tables if they arework-conserving for processing any active node IDs.

    A. Edge GroupingThe main objective of the intrastage hash table partition is

    to guarantee the work-conserving property of the partitionedhash tables, so that the processing throughput of PM can be1An unqueried edge of an active node ID could be either a real edge or an

    unreal edge on the signature tree. The query for an unreal edge will cause areturn of search failure.

  • XU et al.: HIGH-THROUGHPUT AND MEMORY-EFFICIENT MULTIMATCH PACKET CLASSIFICATION 987

    maximized and made more predictable. Given work-con-serving hash tables and matching characters, the processingof each active node ID can be finished within parallelhash accesses.Suppose we want to partition the original hash table associ-

    ated with PM into work-conserving hash tables. The moststraightforward way is to divide edges of the original hash tableinto independent edge sets, and then store each of them in anindividual hash table. This way, we can guarantee the work-con-serving property of the partitioned hash tables because edges tobe queried for an active node ID must be dependent on eachother and stored in different hash tables.However, since is a user-specified parameter, hash ta-

    bles may not be sufficient to avoid the dependency among alledges. Therefore, instead of dividing edges of the original hashtable into independent sets, we divide them into setsdenoted by , among which the first setsare all independent edge sets, and the last set is a residual edgeset, which stores edges that do not fit into the first sets. Theabove action is called edge-grouping. We call edges in the inde-pendent edge sets regular edges, and edges in the residual edgeset residual edges.Given the edge sets after the edge-grouping, we can

    store edges of each independent edge set in an individual hashtable, while duplicate edges of the residual edge set into allhash tables. When an active node is being processed, we firstquery its regular edges, and then its residual edges. It is easilyseen that no hash table would be left idle if there were an un-queried edge. Therefore, the work-conserving property of thepartitioned hash tables is guaranteed.Actually, the problem of edge-grouping itself is not difficult.

    The main challenge comes from the following three questions.1) Given an edge (real or unreal), how can we locate the par-titioned hash table in which the edge is stored?

    2) How can we minimize the overhead caused by the redun-dancy of residual edges?

    3) How can we balance the sizes of partitioned hash tables?We name these three problems, respectively, as the location

    problem, the overhead minimization problem, and the balanceproblem, and present two edge-grouping schemes to deal withthem.

    B. Character-Based Edge-GroupingThe first edge-grouping scheme is named character-based

    edge-grouping (CB_EG). Its basic idea is to divide edgesaccording to their associated characters, and then embed thegrouping information in the encodings of characters. Morespecifically, we reserve the first bit of eachcharacter for the locating prefix, whose value is between 0 and. If the locating prefix of a character is 0, edges labeled with

    the character are residual edges, and thus can be found in anypartitioned hash tables. Otherwise, edges labeled with the char-acter are regular edges and can only be found in the partitionedhash table indexed by the locating prefix. The location problemof edge-grouping is thereby solved.To address the overhead minimization problem and the bal-

    ance problem, wemodel the CB_EG scheme as aweighted char-acter grouping (WCG) problem.

    Algorithm 1: Greedy Algorithm for WCG Problem

    1: Input:2: : the universal character set;3: : the number of independent character sets;4: : the weight of character ;5: Output:6: independent character sets ;7: residual character set ;8:9: ;10: ; initialize the weight of

    set11: Sort in decreasing order of the character weight.12: If is empty, return ( );13: From select the character with the largest weight;14: Select the set with the smallest weight among sets

    whose characters are all independent of . Ifthere is more than one such set, select the first one. If nosuch set is found, put into set , remove from set, and go to step 12;

    15: into set ; remove from set ; ;Go to step 12.

    Let be the universal character set associated with PM . Letbe nonoverlapping character sets

    divided from .Let be an arbitrary character in , and be its weight

    function, meaning the number of edges labeled with in theoriginal hash table.Let be the weight of character

    set .Let be the dependence indicator. , , if is

    dependent on , ; otherwise, .The WCG problem is formally described in Table V.

    The WCG problem is to find a valid configuration ofto achieve the given objective.

    We have proved that the WCG problem is an NP-hard problem,the proof of which is provided in the Appendix. Thus, wepropose the greedy algorithm in Algorithm 1 to solve the WCGproblem.According to character sets returned by the greedy

    WCG algorithm, we assign each character a locating prefix anddivide edges of the original hash table into edge sets.The principle is that for each character in ,we let be its locating prefix and allocate its associated edges toedge set ; for each character in , we let 0 be its locatingprefix and allocate its associated edges to edge set . Afterthat, we can get partitioned hash tables by allocating edgesof to hash table and duplicating edges of

    in every hash table.Let us consider an example, which partitions the last hash

    table of Fig. 3 into two work-conserving hash tables with theCB_EG scheme. First of all, we get the universal character setassociated with PM 4. It has three characters: , , and ,whose weights are 5, 1, and 1, respectively. With the greedy

  • 988 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 22, NO. 3, JUNE 2014

    TABLE VWEIGHTED CHARACTER GROUPING PROBLEM

    Fig. 5. Two partitioned hash tables from the last hash table in Fig. 3. (a) CB_EGscheme. (b) NCB_EG scheme.

    WCG algorithm, the three characters are divided into two inde-pendent character sets ( and ) and one residual characterset , among which contains , contains , andcontains . Therefore, edges labeled with character andare allocated to the first partitioned hash table and the secondpartitioned hash table, respectively, while edges labeled withcharacter are duplicated in both partitioned hash tables. Thefinal partition result is shown in Fig. 5(a).We use to denote the partition efficiency of a hash table

    partition and define it as

    (8)

    In the example above, the partition efficiency is only 58.3% fortwo reasons: First is the redundancy caused by the residual edge

    ; second is the extreme imbalance between twopartitioned hash tables. This imbalance is caused by the inherentproperty of the CB_EG scheme, which has to allocate edgeslabeled with the same character into the same hash table. In thelast hash table of Fig. 3, five out of seven edges are labeled withthe same character . According to the CB_EG scheme, thesefive edges have to be allocated to the same hash table, whichresults in the imbalance between partitioned hash tables.

    As a matter of fact, in real classifiers, the second reason de-grades the partition efficiency more severely than the first. Thisis because many fields of real classifiers have very imbalanceddistributions of field values. For instance, the transport-layerprotocol field of real classifiers is restricted to a small set of fieldvalues, such as TCP, UDP, ICMP, etc. Most entries, say 80%, ofreal classifiers are associated with the TCP protocol. With theCB_EG scheme, edges labeled with TCP have to be allocatedto the same hash table, which may cause an extreme imbalanceof hash tables, and thus result in low partition efficiency.

    C. Node-Character-Based Edge-Grouping

    The second edge-grouping scheme is named Node-Char-acter-Based Edge-Grouping (NCB_EG), which divides edgesnot only based on their labeled characters, but also based on theIDs of their source nodes. According to Property 2, the ID ofeach node on the signature tree can be assigned to any valuesas long as there are no two nodes at the same level assignedthe same ID. With this property, the NCB_EG scheme storesthe grouping information of each edge in both the encoding ofthe edges associated character and the ID of the edges sourcenode. More specifically, the NCB_EG scheme reserves the first

    bits of each character for the locating prefix,and the first bits of each node ID for the shiftingprefix. Given an arbitrary edge , suppose thelocating prefix of is , and the shifting prefix of is . Ifequals 0, the edge is a residual edge and can be found in any

    partitioned hash tables. Otherwise, the edge is a regular edgeand can only be found in the partitioned hash table indexed by

    .In order to locate the partitioned hash table in which a given

    edge is stored using the above principle, we have to divide edgesinto different edge sets following the same principle. In contrastto CB_EG, the NCB_EG scheme solves the overheadminimiza-tion problem and the balance problem in two different steps: theLocating Prefix Assignment (LPA) and the Shift Prefix Assign-ment (SPA).The overhead minimization problem is solved in the LPA

    step, in which the universal character set associated with PMis divided into independent character sets and one residualcharacter set. Each character is assigned a locating prefixranging from 0 to according to the character set to which itis allocated. The LPA step can also be described by the WCGproblem given in Table V, with only the objective changedfrom (7) to

    (9)

    We cannot find a polynomial-time optimal algorithm to solvethe WCG problem with the objective in (2); therefore, we usethe greedy WCG algorithm given in Algorithm 1 to solve it.The purpose of the SPA step is to balance the sizes of inde-

    pendent edge sets. This is achieved by assigning a shift prefix toeach node to adjust the edge sets to which the outgoing edges ofthe node are allocated. A heuristic algorithm for the shift prefixassignment is given in Algorithm 2.Consider using the NCB_EG scheme to partition the last hash

    table of Fig. 3 into two work-conserving hash tables. The LPA

  • XU et al.: HIGH-THROUGHPUT AND MEMORY-EFFICIENT MULTIMATCH PACKET CLASSIFICATION 989

    Algorithm 2 Shift Prefix Assignment Algorithm

    1: Input:2: : the number of independent character sets;3: independent character set ;4: residual character set ;5: : the set of nodes at level of the signature tree;6: : the set of outgoing edges of nodes in ;7: Output:8: shift prefixes of nodes in ;9: independent edge sets ;10: residual edge set ;11:12:13: Sort nodes of in decreasing order of the number of

    outgoing edges;14: for each node in do15: Divide the outgoing edges of into sets. The

    principle is that for characters in ,put their associated outgoing edges to ;

    16: Select the largest edge set among; if there are multiple largest

    edge sets, select the first one;17: Select the smallest edge set among

    ; if there are multiple smallest edgesets, select the first one;

    18: Let , and set the shift prefixof as ; //align to to achieve balance among

    19: for each set do20: Move edges from set to set ;21: end for22: Move edges from set to set ;23: end for

    step of NCB_EG is same as that for CB_EG. With the greedyWCG algorithm, we can get two independent character sets (and ) and one residual character set among whichcontains , contains , and contains . Therefore,the locating prefixes of , , and are 1, 2, and 0, respec-tively. Then, the SPA algorithm is used to assign each level-4node of the signature tree a shift prefix. Since Node has twooutgoing edges, while other nodes have only one, it will first beassigned a shift prefix. Since all independent edge sets ( and) are empty at the beginning, we assign a shift prefix of 0 to. Based on the shift prefix on the , and the locating prefix

    on characters, regular edge is allocated to ,and residual edge is allocated to . isthe second node to be assigned a shift prefix. In order to balancethe sizes of and , the shift prefix of is set to 1, so thatthe edge is allocated to according to thelocating prefix of . Similarly, , , , and will beeach assigned a shift prefix, and their outgoing edges are allo-cated to the corresponding edge sets. The final partitioned hashtables, after the edge-grouping, are shown in Fig. 5(b), where

    the residual edge is duplicated in both hashtables.In this example, the hash table partition efficiency is%, which is higher than that with the CB_EG scheme. The

    main reason for this improved partition efficiency is that theNCB_EG scheme is capable of spreading edges labeled withcharacter into different hash tables, so that a better balanceamong partitioned hash tables is achieved.

    VII. PERFECT HASH TABLE CONSTRUCTION

    One major problem involved in the hash table design ishash collision, which may increase the number of memoryaccesses involved in each hash table lookup and slow downthe packet classification speed. Therefore, a perfect hash tablewith the guarantee of no hash collision is desirable to supporthigh-speed packet classification. Although there are many per-fect hashing and alternative algorithms available in literature,most of them require a complicated procedure to generate thehash index (e.g., traversing a tree-like structure) [31], or needmore than one memory access in the worst case to find thecorrect location storing the hash key among multiple potentiallocations [32], [33].In this section, we present a hybrid perfect hash table con-

    struction scheme, which uses an arbitrary uniform hash functionand guarantees that the searched hash key can be located withone memory access. For a given level of the signature tree, theproposed perfect hash table construction scheme can be used tocreate a single hash table (as in Fig. 3) or multiple partitionedhash tables (as in Fig. 5). For simplicity, we consider the casethat each level of the signature tree has one single hash table.The main idea behind the scheme is that if a hash collision oc-curs when placement of a new key into the hash table is at-tempted, the collision might be avoided if the value of the keycan be changed.The hash key in the hash tables in Fig. 3 is the concatenation

    of source node ID and character. According to Properties 1and 2 summarized in Section IV, we can rename either thesource node or the character to change the value of a hashkey and achieve a collision-free placement. When a source nodeat a stage is renamed to avoid collision, we also need to updatethe destination node ID of its pointing edge at the previous stage.Since the change of destination node ID field will not affect thehash entry location, this process only requires minor modifica-tion in the hash table at the previous stage.The main challenge involved in this process is that when a

    node or character is renamed, some existing hash keys in thehash table might have to be replaced if they are associated withthe renamed node or character. We should avoid fluctuation ofthe hash table; i.e., hash keys constantly being added/removedfrom the hash table.The hybrid perfect hash table construction consists of three

    steps: In the first step, we convert the edges in the partition intoan equivalent bipartite graph model; in the second step, we de-compose the bipartite graph into groups of edges and associateeach group with either a node vertex or a character vertex; inthe third step, we place these edges into the perfect hash table.Edges in the same group will be placed into the hash table as awhole.

  • 990 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 22, NO. 3, JUNE 2014

    Fig. 6. Bipartite graph model associated with partition 2 of Fig. 2. In (b), thenumber close to each vertex indicates the sequence in which the correspondingvertex is carved from the graph. (a) Bipartite graph model. (b) Decompositionof the bipartite graph.

    1) Bipartite Graph Decomposition: Given a certain partitionof the signature tree (for example, partition 2 in Fig. 2), webuild a bipartite graph [as shown in Fig. 6(a)] in which eachvertex on the left side represents a node in the partition (callednode vertex), each vertex on the right side represents a character(called character vertex) in the partition, and each edge repre-sents an edge in the partition.2) Bipartite GraphDecomposition:On the bipartite graph, we

    do a least-connected-vertex-first traversal. In each iteration, weselect the least connected vertex (randomly select one if thereare many) and carve the vertex as well as its connected edgesout from the graph. The edges that are carved together will beregarded as being in the same group, and they will be associatedwith the vertex that is carved together with them. This recursiveprocess repeats itself until the bipartite graph becomes empty.Given the bipartite graph in Fig. 6(a), the process of decom-

    position is shown in Fig. 6(b). The first vertex that is carved fromthe bipartite graph is character vertex since it is one ofthe vertices with the least connected edges. The edge connectedto character vertex is also carved out from the bipartitegraph and associated with vertex . After the first carving,node vertex becomes the vertex with the least connectededges and will be carved from the bipartite graph. The edgegroup associated with is empty since has no edgeconnected at the moment when it is carved. The above processcontinues, and after the entire decomposition, each vertex willbe associated with a group of edges (some groups can be emptyas those associated with and ).3) Perfect Hash Table Construction: Then, we start to add

    carved edges into the hash table in reverse order of the decom-position. In other words, edges removed from the bipartite graphfirst are the last placed into the hash table. The edges that werecarved out from the bipartite graph together will be placed intothe hash table as a group. Any collision occurring during theplacement of a group will cause the renaming of the associ-ated dimension (either the source node ID or character) and thereplacement of the entire group. The reason that we place thegroups of edges in reverse order of the decomposition is be-cause the edge groups removed early from the bipartite graphare all very small (most of them have only one edge). Hence,

    placing them into the hash table at the end can increase the suc-cess probability when the hash table becomes full.During the placement of a group of edges, the name of their

    associated vertex will also be decided. Once a group of edges issuccessfully stored into the hash table, the name of its associatedvertex is settled and will not be changed again.It should be noted that the first group placed into the hash

    table is always empty. Consider the decomposition result ofFig. 6(b). The group associated with will be the first placedinto the hash table, which is empty. Therefore we can label as any value. Then, the second group is the one associated withnode , which has only one edge . Since theedge is the first edge placed into the hash table, there is no col-lision and can be assigned any value. The third group tobe placed is the group associated with , which has only oneedge . If there is a collision whenis placed, we will rename the vertex associated with it, whichis in this case, to resolve the collision. Note that the otherdimension of the edge ( ) has its name already settled priorto this placement and will not be renamed to resolve the colli-sion. The above process repeats itself until all edge groups areplaced into the hash table without collision.It is easy to see that with the hybrid perfect hash table con-

    struction solution, all edges are broken into many small inde-pendent groups. A collision occurring during the placement ofa group will not affect the groups that had already been placedinto the hash table, so the process will converge, and a perfecthash table can be constructed. It is worth mentioning that thehash function used in the hybrid perfect hash table constructiondoes not have to be invertible. An arbitrary universal hash func-tion can be used here.

    VIII. IMPLEMENTATION ISSUES ANDPERFORMANCE EVALUATION

    A. Scalability and Incremental Update

    The proposed pipeline architecture supports an arbitrarynumber of dimensions. To add/delete a dimension, we onlyneed to add/remove a PM along with its associated single-di-mensional search engine, CFIFO, AFIFO, and hash tables.The pipeline architecture also supports incremental updates

    of rules. To add/remove a rule, we traverse the signature treealong the path representing the rule and add/remove the corre-sponding edges in the hash tables. Considering the overhead in-volved in perfect hash table update, we would suggest to useconventional hashing schemes to build the hash tables if theclassifier is expected to be updated. Since the complexity of in-sertion/removal operation in hash table is , the pipeline ar-chitecture has a very low complexity for incremental update.

    B. Storage Complexity

    Suppose the maximum number of rules supported by the pro-posed pipeline architecture is , the maximum number of hashtables used at each stage is , and the number of total dimen-sions is . The storage requirement of the pipeline architecturemainly comes from two parts: 1) single-dimensional searchengines; 2) hash tables and AFIFOs/CFIFOs in the pipeline.

  • XU et al.: HIGH-THROUGHPUT AND MEMORY-EFFICIENT MULTIMATCH PACKET CLASSIFICATION 991

    The storage requirement of each single-dimensional searchengine depends greatly on the type of field associated with thedimension. For instance, if the type of field is transport-layerprotocol, a 256-entry table could be used as the search engine,which requires no more than 1 kB of memory. If the type offield is source/destination IP address, an IP lookup engine mustbe the single-dimensional search engine, which might require

    kB of memory [27].For hash tables in the pipeline, we use two different hash

    schemes. The first is the conventional hashing scheme, whichuses a simple linear probing scheme to resolve the hash colli-sion [34]. In order to achieve a low hash collision rate, the con-ventional hash scheme uses a relatively low load factor. The second is the hybrid perfect hashing scheme presented

    in Section VII, which can easily build a perfect hash table witha high load factor . The storage requirement for hashtables at each stage is determined by the number of edgesassociated with that stage , the number of bits used to repre-sent each edge , the load factor of hash tables, and the parti-tion efficiency when multiple hash tables are used. canbe represented by (3)

    (10)

    Since each edge is represented by , whereis the ID of the source node of , is the character la-

    beled on , and is the ID of the destination node of ; thenumber of bits required to represent each edge is equal to thesum of the numbers of bits used for representing , , and. It is easily seen that the number of nodes at each level of

    the signature tree is no more than ; therefore, and caneach be represented by bits, where the first

    bits are the shift prefix, and the last bits areused to uniquely identify the node at each level of the signaturetree. The number of characters in the universal character set ofeach dimension is equal to the number of unique ranges on thatdimension. It is easy to see that the unique range on each di-mension is no more than . Therefore, could be encoded as

    bits, where the firstbits are the locating prefix, and the last bits are usedto uniquely identify the character (range) on the dimension. Tosum up, the number of bits used for representing each edge couldbe obtained as

    (11)

    The number of edges to be stored at each stage of the pipelineis bounded by ; therefore, . If we assume the hash tablepartition efficiency is 1 (shortly, we will show that the partitionefficiency of NCB_EG scheme is close to 1), and substitute italong with , and (11) into (12), we can get thetotal storage requirement of the hash tables at each stage as in

    (12)

    Since there are stages in the pipeline, the total memoryrequirement is bounded by bytes. The actualmemory requirement is much less that this number since the

    number of edges in the first few stages is much less than . Inaddition, by using the perfect hash table with a high load factor

    , the memory requirement can be further reduced.Regarding AFIFOs and CFIFOs, we will later show that each

    AFIFO/CFIFO needs only a small piece of memory, say eightentries, to achieve a good enough performance. If , thetotal storage requirement for five AFIFOs and four CFIFOs isless than 200 B, which is insignificant compared to the storagerequirement of hash tables.As a whole, the total storage requirement of the pipeline ar-

    chitecture, excluding the single-dimensional search engines, isbounded by bytes. If we substitute ,

    in it, the storage requirement is bounded by 192 kB, whichmakes our pipeline architecture among the most compact packetclassification schemes proposed in literature, even if we countin the memory required by the single-dimensional search en-gines. The actual memory requirement of the proposed pipelinewill be presented in Section VIII-C using classifiers created byClassBench [35].

    C. Performance Evaluation

    Three packet classification schemes are used in the eval-uation: the proposed pipeline architecture, B2PC [26], andHyperCuts [12]. We implemented the first two schemes inC++ and obtained the HyperCuts implementation from [36].Since the processing speed of a networking algorithm is usuallydetermined by the number of memory accesses required forprocessing a packet, we count and compare the average numberof memory accesses required to process a packet in differentschemes. In the paper, we define one time-slot as the timerequired for one memory access. To evaluate the performanceof the proposed pipeline architecture, we use ClassBench toolsuites developed by Taylor to generate classifiers and traffictraces [35]. Three types of classifiers are used in the evaluation:ACL, Firewalls (FW), and IP Chains (IPC). We generate twoclassifiers for each type using the provided filter seed files andname them ACL1, ACL2, FW1, FW2, IPC1, and IPC2, each ofwhich has five dimensions and about 4K rules.2We first evaluate the partition efficiency. Table VI shows the

    partition efficiencies of CB_EG and NCB_EG under classifiersACL1, FW1, and IPC1 with a different number of partitionedhash tables ( ) at each stage. Apparently, NCB_EG alwaysoutperforms CB_EG and can achieve a partition efficiencyhigher than 90% in most situations. The only exception is theDestination IP field, where the partition efficiency achievedby NCB_EG ranges from 67% to 96%. The reason for thisrelatively low percentage is that level 1 of the signature tree,which is associated with the Destination IP field, has fewernodes than the other levels. This small number of nodes atlevel 1, however, each has a relatively large fan-out, whichlowers the efficiency of the SPA algorithm and increases theimbalance between partitioned hash tables. Fortunately, thenumber of edges associated with the first stage of the pipelineis far less than that associated with other stages. Therefore, therelatively low partition efficiency would not increase too much2The generated rules are slightly less than 4K because of the existence of

    redundant rules. [35]

  • 992 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 22, NO. 3, JUNE 2014

    TABLE VIPARTITION EFFICIENCY OF CB_EG AND NCB_EG AT DIFFERENT STAGES OF THE PIPELINE WITH DIFFERENT CLASSIFIERS

    Fig. 7. Time-slots for generating one result versus AFIFO size, when conven-tional hash scheme is used.

    of the storage requirement. In the remainder of this section, allsimulations are conducted with the NCB_EG scheme.In the proposed pipeline, when an AFIFO becomes full, the

    backpressure would prevent the upstream PM from processingnew active node IDs; therefore, the size of an AFIFO might af-fect the throughput of the pipeline to a certain extent. Fig. 7shows the relationship between AFIFO size and the averagenumber of time-slots needed for exporting one classification re-sult when a conventional hash scheme is used. Curves in thefigure show that the throughput of the pipeline is not sensitiveto AFIFO size. When AFIFO size is larger than 8, the pipelinecan achieve stable throughputs regardless of the classifier typesor the value of . Further increasing AFIFO size cannot leadto significant throughput improvement. Therefore, in the re-mainder of our simulations, the sizes of AFIFOs are all set toeight entries.In Fig. 8, we compare the proposed pipeline architecture to

    HyperCuts and the B2PC scheme in terms of the average time-slots required for each classification operation. In the compar-ison, we use two different hash schemes to implement the dis-tributed hash tables: the conventional hash scheme with, and the proposed hybrid perfect hash scheme with

    . Since the original B2PC scheme was designed to returnonly the most specific rule, we changed it to return all matchedrules. The bucket size of HyperCuts is set to 16, and its spacefactor is set to 4 (optimized for speed). We suppose that eachmemory access of HyperCuts could read 64 bits. Based on theresults in [27], we assume the single-dimensional search enginesare able to return a search result in every 2.2 memory accesses(time-slots).Fig. 8 shows that for IPC2, the proposed pipeline can com-

    plete one classification operation in every 3.79 time-slots evenwhen there is only one (conventional) hash table at each stage.If we use perfect hash tables to replace the conventional hashtables, the performance can be improved to 2.87 time-slots foreach classification operation. The reason that the pipeline withperfect hash tables performs better is that the conventional hashtables may introduce hash collisions during the lookup opera-tions, thereby increasing the number of memory accesses foreach operation. For IPC2, there is just a slight performance im-provement (from 3.79 to 2.22 time-slots/operation) whenincreases from 2 to 4. This is because the packet classifica-tion speed has already reached the speed limitation of single-di-mensional search engines. In contrast, for ACL1 the proposedpipeline architecture needs about 20 time-slots to finish a packetclassification when . The performance gets significantlybetter when increases to 4, where the proposed pipeline canexport one classification result in every 10.52 time-slots withconventional hash tables and 6.27 time-slots with perfect hashtables.The proposed pipeline architecture has very strong ro-

    bustness. It significantly outperforms HyperCuts and B2PCschemes for all tested classifiers. Although part of the per-formance improvement is gained from the parallelism of thepipeline (in fact, the B2PC scheme also employs many parallelbloom filters to accelerate its classification speed), the use ofparallelism does not increase the overall storage cost thanks tothe high partition efficiency provided by the NCB_EG scheme.For ACL2, FW1, and IPC1, the HyperCuts scheme requiresmore than 200 time-slots on average to perform each packetclassification. The reason for this slow processing speed isthat these classifiers have lots of overlapping ranges/fields atsource/destination IP address fields and source/destination portfields. The large number of overlapping ranges can cause alarge number of rules replicated in leaves [12], which leads toa steep increase in the number of memory accesses. Although

  • XU et al.: HIGH-THROUGHPUT AND MEMORY-EFFICIENT MULTIMATCH PACKET CLASSIFICATION 993

    Fig. 8. Average time-slots required for each classification operation under different schemes.

    TABLE VIIHASH RATE WHEN CONVENTIONAL HASH SCHEME IS USED

    authors in [12] claimed that the performance of HyperCutscould be improved by using a pipeline, it is unclear yet whatperformance the pipelined-version of HyperCuts would achievesince the last stage of the pipelined-version of HyperCuts stillneeds to search for a large number of replicated rules in leafnodes.From Fig. 8, it is easy to see that using the perfect hash table

    can significantly improve the classification speed (by 30%50%under most classifiers and settings). Table VII shows the hashtable collision rate under different classifiers and settings ofwhen the conventional hash scheme is used to build hash tableswith .In Table VIII, we compare the storage cost of the proposed

    pipeline with those of HyperCuts and B2PC. The storage costof the proposed pipeline consists of two major parts: 1) single-dimensional search engines, and 2) hash tables. According tothe single-dimensional search proposed in [27], the storage re-quirement for a single-dimensional search engine is 78 kB, or312 kB for four engines (the storage requirement of the single-dimensional search engine for the transport-layer protocol fieldis omitted here). The calculation of the storage cost of hash ta-bles is based on (10) and (11), where the load factors ofconventional hash tables and perfect hash tables are configuredas 0.5 and 0.8, respectively.The small storage requirement allows the proposed pipeline

    to fit into a commodity FPGA, in which the hash tables could beimplemented by on-chip SRAM. Suppose the on-chip SRAMaccess frequency is 400 MHz, and the smallest size of an IPpacket is 64 B. From Fig. 8, the proposed pipeline withoutperfect hash tables achieves the best performance under IPC2

    TABLE VIIISTORAGE REQUIREMENT REQUIRED BY DIFFERENT SCHEMES

    (with 2.25 time-slots/operation) and the worst performanceunder ACL1 (with 10.52 time-slots/operation), which leads toa throughput between 19.5 and 91 Gb/s. With the perfect hashtables, the proposed pipeline achieves the best performanceunder IPC2 and FW2 (with 2.2 time-slots/operation) and theworst performance under ACL2 (with 7.65 time-slots/opera-tion), leading to an improved throughput of between 26.8 and93.1 Gb/s.

    IX. CONCLUSION

    In this paper, we model the multimatch packet classificationas a concatenated multistring matching problem, which can besolved by traversing a flat signature tree. To speed up the tra-versal of the signature tree, the edges of the signature tree aredivided into different hash tables in both vertical and horizontaldirections. These hash tables are then connected together by apipeline architecture, and they work in parallel when packetclassification operations are performed. A perfect hash tableconstruction is also presented, which guarantees that each hashtable lookup can be finished in exactly one memory access. Be-cause of the large degree of parallelism and elaborately designededge partition scheme, the proposed pipeline architecture is ableto achieve an ultra-high packet classification speed with a verylow storage requirement. Simulation results show that the pro-posed pipeline architecture outperforms HyperCuts and B2PCschemes in classification speed by at least one order of magni-tude with a storage requirement similar to that of the HyperCutsand B2PC schemes.

  • 994 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 22, NO. 3, JUNE 2014

    APPENDIXPROOF: THE WCG PROBLEM IS AN NP-HARD PROBLEM

    We first introduce an Average-Division Problem, which hasbeen proved in [37] to be an NP-Complete problem.Given a finite set and a weight function

    , we ask whether there is a subset satisfying

    (13)

    The Average-Division Problem can be defined in a language

    is a function fromthere exists a subset such that

    To prove that theWCGproblem is NP-hard, we first introducethe decision problem of WCG and then define it as a language.We will show that the language of AVG_DIV is reducible to thelanguage of WCG in polynomial time.The decision problem of the WCG problem is formally de-

    fined as follows:Given a field set , a set number , a dependence indicator, a weight function , and a real number , we ask whetherthere exists a character grouping scheme thatsatisfies the constraints (1)(6) defined in the WCG problem, aswell as the new constraint

    (14)

    We define the WCG problem as a language::

    : a set of fields; each field is defined in the form of, where and are two real numbers satisfying ;: the number of sets;: a weight function from ;: a real number,

    there exists a character grouping schemesatisfying the constraints above .

    Now we will show that the AVG_DIV language can be re-duced to WCG language in polynomial time.Let be an instance of AVG_DIV. We construct an

    instance of WCG as follows.Let be the set of independent fields; thus con-

    straints (4) and (5) in Table V can be eliminated. For each, let . Let be 2, and

    be . Now the constraints can be reconstructed asfollows:

    (15)

    (16)

    (17)(18)

    (19)

    It is easy to know that (19) is equivalent to

    and (20)

    Next we will show that AVG_DIV if and onlyif WCG when .Suppose there exists an allocation scheme sat-

    isfying (15)(20). Let . According to (18) and(20), . Conversely, suppose thereexists a set satisfying .Let , , and . Apparently,(15)(20) are satisfied.As we have seen, AVG_DIV can be reduced toWCG in poly-

    nomial time; thus, WCG is an NP-hard problem.

    REFERENCES[1] K. Lakshminarayanan, A. Rangarajan, and S. Venkatachary, Algo-

    rithms for advanced packet classification with ternary CAMS, inProc.ACM SIGCOMM , New York, NY, USA, 2005, pp. 193204.

    [2] M. Faezipour and M. Nourani, Wire-speed TCAM-based architec-tures for multimatch packet classification, IEEE Trans. Comput., vol.58, no. 1, pp. 517, Jan. 2009.

    [3] M. Faezipour and M. Nourani, Cam011: a customized TCAMarchitecture for multi-match packet classification, in Proc. IEEEGLOBECOM, Dec. 2006, pp. 15.

    [4] F. Yu, R. H. Katz, and T. V. Lakshman, Efficient multimatch packetclassification and lookup with TCAM, IEEE Micro, vol. 25, no. 1, pp.5059, Jan. 2005.

    [5] F. Yu, T. V. Lakshman, M. A. Motoyama, and R. H. Katz, SSA: apower and memory efficient scheme to multi-match packet classifica-tion, in Proc. ACM ANCS, New York, NY, USA, 2005, pp. 105113.

    [6] Snort, A free lightweight network intrusion detection system forUNIX and Windows, 2013 [Online]. Available: http://www.snort.org

    [7] S. Kumar, J. Turner, and J. Williams, Advanced algorithms for fastand scalable deep packet inspection, in Proc. ACM/IEEE ANCS, NewYork, NY, USA, 2006, pp. 8192.

    [8] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner, Al-gorithms to accelerate multiple regular expressions matching for deeppacket inspection, in Proc. ACM SIGCOMM, New York, NY, USA,2006, pp. 339350.

    [9] P. Gupta and N. McKeown, Algorithms for packet classification,IEEE Netw., vol. 15, no. 2, pp. 2432, Mar.Apr. 2001.

    [10] H. J. Chao and B. Liu, High Performance Switches and Routers.Hoboken, NJ, USA: Wiley-IEEE Press, 2007.

    [11] P. Gupta and N. McKeown, Classifying packets with hierarchical in-telligent cuttings, IEEE Micro, vol. 20, no. 1, pp. 3441, Jan.Feb.2000.

    [12] S. Singh, F. Baboescu, G. Varghese, and J.Wang, Packet classificationusing multidimensional cutting, in Proc. SIGCOMM, New York, NY,USA, 2003, pp. 213224.

    [13] T. V. Lakshman and D. Stiliadis, High-speed policy-based packet for-warding using efficient multi-dimensional range matching, in Proc.ACM SIGCOMM, New York, NY, USA, 1998, pp. 203214.

    [14] V. Srinivasan, G. Varghese, S. Suri, and M. Waldvogel, Fast andscalable layer four switching, Comput. Commun. Rev., vol. 28, pp.191202, Oct. 1998.

  • XU et al.: HIGH-THROUGHPUT AND MEMORY-EFFICIENT MULTIMATCH PACKET CLASSIFICATION 995

    [15] B.Vamanan, G.Voskuilen, and T.N.Vijaykumar, Efficuts: optimizingpacket classification for memory and throughput, in Proc. ACM SIG-COMM, New York, NY, USA, 2010, pp. 207218.

    [16] H. Liu, Efficient mapping of range classifier into ternary-CAM, inProc. 10th HOTI, Washington, DC, USA, 2002, pp. 95100.

    [17] A. Liu, C. Meiners, and Y. Zhou, All-match based complete redun-dancy removal for packet classifiers in TCAMs, in Proc. 27th IEEEINFOCOM, Apr. 2008, pp. 111115.

    [18] B. Vamanan and T. N. Vijaykumar, Treecam: decoupling updates andlookups in packet classification, in Proc. 7th ACM CoNEXT, NewYork, NY, USA, 2011, pp. 27:127:12.

    [19] J. v. Lunteren and T. Engbersen, Fast and scalable packet classifica-tion, IEEE J. Sel. Areas Commun., vol. 21, no. 4, pp. 560571, May2003.

    [20] Y.-K. Chang, C.-I. Lee, and C.-C. Su, Multi-field range encoding forpacket classification in TCAM, Proc. IEEE INFOCOM, pp. 196200,Apr. 2011.

    [21] A. Bremler-Barr, D. Hay, and D. Hendler, Layered interval codes forTCAM-based classification, Proc. IEEE INFOCOM, pp. 13051313,Apr. 2009.

    [22] R. McGeer and P. Yalagandula, Minimizing rulesets for TCAM im-plementation, Proc. IEEE INFOCOM, pp. 13141322, Apr. 2009.

    [23] R. Wei, Y. Xu, and H. Chao, Block permutations in boolean spaceto minimize TCAM for packet classification, Proc. IEEE INFOCOM,pp. 25612565, Mar. 2012.

    [24] Y. Ma and S. Banerjee, A smart pre-classifier to reduce powerconsumption of TCAMS for multi-dimensional packet classification,Comput. Commun. Rev., vol. 42, no. 4, pp. 335346, Aug. 2012.

    [25] D. Pao, Y. K. Li, and P. Zhou, An encoding scheme for TCAM-based packet classification, in Proc. 8th ICACT, Feb. 2006, vol. 1,pp. 470475.

    [26] I. Papaefstathiou and V. Papaefstathiou, Memory-efficient 5D packetclassification at 40 Gbps, in Proc. 26th IEEE INFOCOM, May 2007,pp. 13701378.

    [27] I. Papaefstathiou and V. Papaefstathiou, An innovative low-costclassification scheme for combined multi-gigabit IP and Ethernetnetworks, in Proc. IEEE ICC, Jun. 2006, vol. 1, pp. 211216.

    [28] M. Degermark, A. Brodnik, S. Carlsson, and S. Pink, Small for-warding tables for fast routing lookups, Comput. Commun. Rev., vol.27, pp. 314, Oct. 1997.

    [29] G. Varghese, Network Algorithmics: An Interdisciplinary Approachto Designing Fast Networked Devices. San Francisco, CA, USA:Morgan Kaufmann, 2004.

    [30] X. Sun, S. K. Sahni, and Y. Q. Zhao, Packet classification consumingsmall amount of memory, IEEE/ACM Trans. Netw., vol. 13, no. 5, pp.11351145, Oct. 2005.

    [31] N. S. Artan and H. Chao, Tribica: Trie bitmap content analyzer forhigh-speed network intrusion detection, Proc. IEEE INFOCOM, pp.125133, May 2007.

    [32] R. Pagh and F. Rodler, Cuckoo hashing, in Proc. ESA, 2001, pp.121133.

    [33] S. Kumar, J. Turner, and P. Crowley, Peacock hashing: Deterministicand updatable hashing for high performance networking, Proc. IEEEINFOCOM, pp. 101105, Apr. 2008.

    [34] G. L. Heileman and W. Luo, How caching affects hashing, in Proc.7th ALENEX, 2005, pp. 141154.

    [35] D. E. Taylor and J. S. Turner, Classbench: a packet classificationbenchmark, IEEE/ACM Trans. Netw., vol. 15, no. 3, pp. 499511,Jun. 2007.

    [36] H. Song, Multidimensional cuttings (hypercuts), 2006 [Online].Available: http://www.arl.wustl.edu/~hs1/project/hypercuts.htm

    [37] K. Zheng, C. Hu, H. Lu, and B. Liu, A TCAM-based distributed par-allel IP lookup scheme and performance analysis, IEEE/ACM Trans.Netw., vol. 14, no. 4, pp. 863875, Aug. 2006.

    Yang Xu (S05M07) received the Bachelor ofEngineering degree from Beijing University of Postsand Telecommunications, Beijing, China, in 2001,and the Master of Science and Ph.D. degrees incomputer science and technology from TsinghuaUniversity, Beijing, China, in 2003 and 2007,respectively.He is a Research Assistant Professor with the De-

    partment of Electrical and Computer Engineering,Polytechnic Institute of New York University,Brooklyn, NY, USA, where he was a Visiting As-

    sistant Professor from 2007 to 2008. His research interests include data centernetwork, network on chip, and high-seed network security.

    Zhaobo Liu received the B.S. and M.S. degreesin electronics engineering from Beijing Instituteof Technology University, Beijing, China, in 2004and 2006, respectively, and the M.S. degree intelecommunication networks from the PolytechnicInstitute of New York University, Brooklyn, NY,USA, in 2009.He is currently working as a Software Engineer

    with Vmware, Inc., Palo Alto, CA, USA.

    Zhuoyuan Zhang received the B.S. degree in elec-trical engineering from the Polytechnic Institute ofNew York University, Brooklyn, NY, USA, in 2010.He has been with the Cloud Drive Team, Amazon,

    Inc., Seattle, WA, USA, and is currently with TezaTechnologies, New York, NY, USA.

    H. Jonathan Chao (S82M83SM95F01)received the Ph.D. degree in electrical engineeringfrom The Ohio State University, Columbus, OH,USA, in 1985.He is the Department Head and a Professor of

    electrical and computer engineering with the Poly-technic Institute of New York University, Brooklyn,NY, USA, which he joined in 1992. During 2000to 2001, he was Co-Founder and CTO of CoreeNetworks, Tinton Falls, NJ, USA, where he led ateam to implement a multiterabit multi-protocol

    label switching (MPLS) switch router with carrier-class reliability. From 1985to 1992, he was a Member of Technical Staff with Telcordia, Piscataway, NJ,USA. He holds 45 patents and has published over 200 journal and conferencepapers. He has also served as a consultant for various companies, such asHuawei, Lucent, NEC, and Telcordia. He coauthored three networking books,Broadband Packet Switching Technologies (Wiley, 2001), Quality of ServiceControl in High-Speed Networks (Wiley, 2001), and High-PerformanceSwitches and Routers (Wiley, 2007). He has been doing research in the areasof network designs in data centers, terabit switches/routers, network security,network on the chip, and biomedical devices.Prof. Chao is a Fellow of the IEEE for his contributions to the architecture

    and application of VLSI circuits in high-speed packet networks. He received theTelcordia Excellence Award in 1987.