Design and Implementation of a Consolidated Middlebox ... · Design and Implementation of a Consolidated Middlebox ... sign of the network controller. We describe the design of each

Design and Implementation of a Consolidated Middlebox ArchitectureVyas Sekar∗, Norbert Egi††, Sylvia Ratnasamy†, Michael K. Reiter?, Guangyu Shi ††

∗ Intel Labs, † UC Berkeley, ? UNC Chapel Hill, †† Huawei

AbstractNetwork deployments handle changing application,workload, and policy requirements via the deploymentof specialized network appliances or “middleboxes”. To-day, however, middlebox platforms are expensive andclosed systems, with little or no hooks for extensibil-ity. Furthermore, they are acquired from independentvendors and deployed as standalone devices with littlecohesiveness in how the ensemble of middleboxes ismanaged. As network requirements continue to growin both scale and variety, this bottom-up approach putsmiddlebox deployments on a trajectory of growing de-vice sprawl with corresponding escalation in capital andmanagement costs.

To address this challenge, we present CoMb, a newarchitecture for middlebox deployments that systemati-cally explores opportunities for consolidation, both at thelevel of building individual middleboxes and in manag-ing a network of middleboxes. This paper addresses keyresource management and implementation challengesthat arise in exploiting the benefits of consolidation inmiddlebox deployments. Using a prototype implementa-tion in Click, we show that CoMb reduces the networkprovisioning cost 1.8–2.5× and reduces the load imbal-ance in a network by 2–25×.

1 IntroductionNetwork appliances or “middleboxes” such as WAN op-timizers, proxies, intrusion detection and prevention sys-tems, network- and application-level firewalls, cachesand load-balancers have found widespread adoption inmodern networks. Several studies report on the rapidgrowth of this market; the market for network securityappliances alone was estimated to be 6 billion dollars in2010 and expected to rise to 10 billion in 2016 [9]. Inother words, middleboxes are a critical part of today’snetworks and it is reasonable to expect that they will re-main so for the foreseeable future.

Somewhat surprisingly then, there has been relativelylittle research on how middleboxes are built and de-ployed. Today’s middlebox infrastructure has developedin a largely uncoordinated manner—a new form of mid-dlebox typically emerging as a one-off solution to a spe-cific need, “patched” into the infrastructure through ad-hoc and often manual techniques.

This bottom-up approach leads to two serious formsof inefficiency. The first is inefficiency in the use of in-frastructure hardware resources. Middlebox applications

are typically resource intensive and each middlebox is in-dependently provisioned for peak load. Today, becauseeach middlebox is deployed as a separate device, theseresources cannot be amortized across applications eventhough their workloads offer natural opportunities to doso. (We elaborate on this in §3). Second, a bottom-upapproach leads to inefficiencies in management. Today,each type of middlebox application has its own customconfiguration interface, with no hooks or tools that offernetwork administrators a unified view by which to man-age middleboxes across the network.

As middlebox deployments continue to grow in bothscale and variety, these inefficiencies are increasinglyproblematic—middlebox infrastructure is on a trajectoryof growing device sprawl with corresponding escalationin capital and management costs. In §2, we present mea-sured and anecdotal evidence that highlights these con-cerns in a real-world enterprise environment.

This paper presents CoMb,1 a top-down design formiddlebox infrastructure that aims to tackle the aboveinefficiencies. The key observation in CoMb is that theabove inefficiencies arise because middleboxes are builtand managed as standalone devices. To address this, weturn to the age-old idea of consolidation and systemati-cally re-architect middlebox infrastructure to exploit op-portunities for consolidation. Corresponding to the inef-ficiencies, CoMb targets consolidation at two levels:1. Individual middleboxes: In contrast to standalone,

specialized middleboxes, CoMb decouples the hard-ware and software, and thus enables software-basedimplementations of middlebox applications to run ona consolidated hardware platform.2

2. Managing an ensemble of middleboxes: CoMb con-solidates the management of different middlebox ap-plications/devices into a single (logically) centralizedcontroller that takes a unified, network-wide view—generating configurations and accounting for policyrequirements across all traffic, all applications, and allnetwork locations. This is in contrast to today’s ap-proach where each middlebox application and/or de-vice is managed independently.In a general context, the above strategies are not new.

There is a growing literature on centralized networkmanagement (e.g., [21, 31]), and consolidation is com-monly used in data centers. To our knowledge, however,

1The name CoMb captures our goal of Consolidating Middleboxes.2As we discuss in §4, this hardware platform can comprise both

general-purpose and specialized components.

there has been no work on quantifying the benefits ofconsolidation for middlebox infrastructure, nor any in-depth attempt to re-architect middleboxes (at both thedevice- and network-level) to exploit consolidation.

Consolidation effectively “de-specializes” middleboxinfrastructure since it forces greater modularity and ex-tensibility. Typically, moving from a specialized ar-chitecture to one that is more general results in less,not more, efficient resource utilization. We show, how-ever, that consolidation creates new opportunities for ef-ficient use of hardware resources. For example, withinan individual box, we can reduce resource requirementsby leveraging previously unexploitable opportunities tomultiplex hardware resources and reuse processing mod-ules across different applications. Similarly, consolidat-ing middlebox management into a network-wide viewexposes the option of spatially distributing middleboxprocessing to use resources at different locations.

However, the benefits of consolidation come withchallenges. The primary challenge is that of resourcemanagement since middlebox hardware resources arenow shared across multiple heterogeneous applicationsand across the network. We thus need a resource man-agement solution that matches demands (i.e., what subsetof traffic needs to be processed by each application, whatresources are required by different applications) to re-source availability (e.g., CPU cycles and memory at var-ious network locations). In §4 and §5, we develop a hi-erarchical strategy that operates at two levels—network-wide and within an individual box—to ensure the net-work’s traffic processing demands are met while mini-mizing resource consumption.

We prototype a CoMb network controller leveragingoff-the-shelf optimization solvers. We build a prototypeCoMb middlebox platform using Click [30] running ongeneral-purpose server hardware. As test applications weuse: (i) existing software implementations of middleboxapplications (that we use with minimal modification) and(ii) applications that we implement using a modular dat-apath. (The latter were developed to capture the bene-fits of processing reuse). Using our prototype and trace-driven evaluations, we show that:• At a network-wide level, CoMb reduces aggregate re-

source consumption by a factor 1.8–2.5× and reducesthe maximum per-box load by a factor 2–25×.

• Within an individual box, CoMb imposes little or min-imal overhead for existing middlebox applications. Inthe worst case, we record a 0.7% performance droprelative to running the same applications indepen-dently on dedicated hardware.

Roadmap: In the rest of the paper, we begin with amotivating case study in §2. §3 highlights the new effi-ciency opportunities with CoMb and §4 describes the de-

Appliance type NumberFirewalls 166

NIDS 127Conferencing/Media gateways 110

Load balancers 67Proxy caches 66VPN devices 45

WAN optimizers 44Voice gateways 11

Middleboxes total 636Routers ≈ 900

Table 1: Devices in the enterprise network

sign of the network controller. We describe the design ofeach CoMb box in §5 and our prototype implementationin §6. We evaluate the potential benefits and overheadswith CoMb in §7. We discuss concerns about isolationand deployment in §8. We present related work in §9before concluding in §10.

2 MotivationWe begin with anecdotal evidence in support of our claimthat middlebox deployments constitute a vital componentin modern networks and the challenges that arise therein.Our observations are based on a study of middlebox de-ployment in a large enterprise network and discussionswith the enterprise’s administrators. The enterprise spanstens of sites and serves more than 80K users [36].

Table 1 summarizes the types and numbers of differ-ent middleboxes in the enterprise and shows that the totalnumber of middleboxes is comparable to the number ofrouters! Middleboxes are thus a vital portion of the enter-prise’s network infrastructure. We further see a large di-versity in the type of middleboxes; other studies suggestsimilar diversity in ISPs and datacenters as well [13, 26].

The administrators indicated that middleboxes repre-sent a significant fraction of their (network) capital ex-penses and expressed the belief that processing com-plexity contributes to high capital costs. They expressedfurther concern over anticipated mounting costs. Twonuggets emerged from their concerns. First, they re-vealed that each class of middleboxes is currently man-aged by a dedicated team of administrators. This is inpart because the enterprise uses different vendors foreach application in Table 1; the understanding requiredto manage and configure each class of middlebox leadsto inefficient use of administrator expertise and signif-icant operational expenses. The lack of high-level con-figuration interfaces further exacerbates the problem. Forexample, significant effort was required to manually tunewhat subset of traffic should be directed to the WAN op-timizers to balance the tradeoff between the bandwidthsavings and appliance load. The second nugget of inter-est was their concern that the advent of consumer devices(e.g., smartphones, tablets) is likely to increase the need

for in-network capabilities [9]. The lack of extensibil-ity in middleboxes today inevitably leads to further ap-pliance sprawl, with associated increases in capital andoperating expenses.

Despite these concerns, administrators reiterated thevalue they find in such appliances, particularly in sup-porting new applications (e.g., teleconferencing), in-creasing security (e.g., IDS), and improving performance(e.g., WAN optimizers).

3 CoMb: Overview and OpportunitiesThe previous discussion shows that even though middle-boxes are a critical part of the network infrastructure,they remain expensive, closed platforms that are diffi-cult to extend, and difficult to manage. This motivates usto rethink how middleboxes are designed and managed.We envision an alternative architecture, called CoMb,wherein software-centric implementations of middle-box applications are consolidated to run on a sharedhardware platform, and managed in a logically central-ized manner.

The qualitative benefits of this proposed architectureare easy to see. Software-based solutions reduce the costand development cycles to build and deploy new mid-dlebox applications (as independently argued in paral-lel work [12]). Consolidating multiple applications onthe same physical platform reduces device sprawl; wealready see early commercial offerings in this regard(e.g., [3]). Finally, the use of centralization to simplifynetwork management is also well known [31, 21, 15].

While the qualitative appeal is evident, there are prac-tical concerns with respect to efficiency. Typically, mov-ing from a specialized architecture to one that is moregeneral and extensible results in less efficient resourceutilization. However, as we show next, CoMb introducesnew efficiency opportunities that do not arise with to-day’s middlebox deployments.

3.1 Application multiplexingConsider a WAN optimizer and IDS running at an enter-prise site. The former optimizes file transfers betweentwo enterprise sites and may see peak load at night whensystem backups are run. In contrast, the IDS may seepeak load during the day because it monitors users’ webtraffic. Suppose the volumes of traffic processed by theWAN optimizer and IDS at two time instants t1, t2 are10,50 packets and 50,10 packets respectively. Todayeach hardware device must be provisioned to handle itspeak load resulting in a total provisioning cost corre-sponding to 2 ∗max{10,50} = 100 packets. A CoMbbox, running both a WAN optimizer and the IDS on thesame hardware can flexibly allocate resources as the loadvaries. Thus, it needs to be provisioned to handle thepeak total load of 60 packets or 40% fewer resources.

0

0.2

0.4

0.6

0.8

1

07-09,06:0007-09,17:00

07-10,04:0007-10,15:00

07-11,02:00

Nor

mal

ized

util

izat

ion

Time (mm-dd,hr)

WAN optimizerProxy

Load BalancerFirewall

Figure 1: Middlebox utilization peak at different times

Session Management

Protocol Parsers

VPN WanOpt IDS Proxy

Firewall

Figure 2: Reusing lower-layer software modules acrossmiddlebox applications

Figure 1 shows a time series of the utilization of fourmiddleboxes at an enterprise site, each normalized by itsmaximum observed value. Let NormUtiltapp denote thenormalized utilization of the device app at time t. Now,to quantify the benefits of multiplexing, we compare thesum of the peak utilizations ∑app maxt{NormUtiltapp}= 4and the peak total utilization maxt{∑app NormUtiltapp}=2.86. For the workload shown in Figure 1, multiplexingrequires 4−2.86

4 = 28% fewer resources.

3.2 Reusing software elementsEach middlebox typically needs low-level modules forpacket capture, parsing headers, reconstructing sessionstate, parsing application-layer protocols and so on. Ifthe same traffic is processed by many applications—e.g.,HTTP traffic is processed by an IDS, proxy, and an appli-cation firewall—each appliance has to repeat these com-mon actions for every packet. When these applicationsrun on a consolidated platform, we can potentially reusethese basic modules (Figure 2).

Consider an IDS and proxy. Both need to recon-struct session- and application-layer state before runninghigher-level actions. Suppose each device needs 1 unitof processing per packet. For the purposes of this ex-ample, let us assume that these common tasks contribute50% of the overall processing cost. Both appliances pro-cess HTTP traffic, but may also process traffic uniqueto each context; e.g., IDS processes UDP traffic whichthe proxy ignores. Suppose there are 10 UDP packetsand 45 HTTP packets. The total resource requirement is(IDS = 10+45)+ (Proxy = 45) = 100 units. The setup

in Figure 2 with reusable modules avoids duplicating thecommon tasks for HTTP traffic and needs 45∗0.5= 22.5fewer resources.

As this example shows, this reduction depends on thetraffic overlap across applications and the contribution ofthe reusable modules. To measure the overlap, we obtainconfigurations for Bro [32] and Snort3 and the configura-tion for a WAN optimizer. Then, using flow-level tracesfrom Internet2, we find that the traffic overlap betweenapplications is typically 64–99% [36]. Our benchmarksin §7.1 show that common modules contribute 26–88%across applications.

3.3 Spatial distributionConsider the topology in Figure 3 with three nodes N1–N3 and three end-to-end paths P1–P3. The traffic onthese paths peaks to 30 packets at different times asshown. Suppose we want all traffic to be monitored byIDSes. Today’s default deployment is an IDS at eachingress N1, N2, and N3 for monitoring traffic on P1, P2,and P3 respectively. Each such IDS needs to be provi-sioned to handle the peak volume of 30 units with a totalnetwork-wide cost of 90 units.

N1

N2

N3

P1: N1à N2 T1 15 T2 30 T3 15

P2: N2à N3 T1 15 T2 15 T3 30

P3: N3à N1 T1 15 T2 15 T3 30

P1 P2 P3 T1 15 0 5 20 T2 20 0 0 20 T3 15 0 5 20

N1’s assignment N2’s assignment N3’s assignment P1 P2 P3

T1 0 15 5 20 T2 10 10 0 20 T3 0 20 0 20

P1 P2 P3 T1 0 0 20 20 T2 0 5 15 20 T3 0 10 10 20

Figure 3: Spatial distribution as traffic changes

With a centralized network-wide view, however, wecan spatially distribute the IDS responsibilities. That is,each IDS at N1–N3 processes a fraction of the traffic onthe paths traversing the node (e.g., [37]). For example, attime T1, N1 uses 15 units for P1 and 5 for P3; N2 uses15 units for P2 and 5 P3; and N3 devotes all 20 units toP3. We can generate similar configurations for the othertimes as shown in Figure 3. Thus, distribution reducesthe total provisioning cost 90−60

90 = 33% compared to aningress-only deployment. Note that this is orthogonal toapplication multiplexing and software reuse.

Using time-varying traffic matrices from Internet andthe Enterprise network, we find that spatial distributioncan provide 33–55% savings in practice.

3.4 CoMb OverviewBuilding on these opportunities, we envision the archi-tecture in Figure 4. Each CoMb box runs multiplesoftware-based applications (e.g., IDS, Proxy). These

3www.snort.org

Network-wide Controller

AppSpec BoxSpec NetworkSpec

Session

Protocol

Extensible Standalone

Processing Responsibilities

Session

Protocol

Extensible Standalone

Figure 4: The network controller assigns processing re-sponsibilities to each CoMb box.

applications can be obtained from independent vendorsand could differ in their software architectures (e.g., stan-dalone vs. modular). CoMb’s network controller assignsprocessing responsibilities across the network. EachCoMb middlebox receives this configuration and allo-cates hardware resources to the different applications.

4 CoMb Network ControllerIn this section, we describe the design of CoMb’s net-work controller and the management problem it solvesto assign network-wide middlebox responsibilities.

4.1 Input ParametersWe begin by describing the three high-level inputs thatthe network controller needs.• AppSpec: For each application m (e.g., IDS, proxy,

firewall), the AppSpec specifies: (1) Tm, the traf-fic that m needs to run on (e.g., what ports and pre-fixes), and (2) policy constraints that the administra-tor wants to enforce across different middlebox in-stances. These constraints specify constraints on theprocessing order for each packet [25]. For example,all web traffic goes through a firewall, then an IDS,and finally a web proxy. Most middlebox applicationstoday operate at a session-level granularity and we as-sume each m operates at this granularity.4

• NetworkSpec: This has two components: (1) a de-scription of end-to-end routing paths and the loca-tion of the middlebox nodes on each path and, (2) apartition T = {Tc}c of all traffic into classes. Eachclass Tc can be specified with a high-level descrip-tion of the form “port-80 sessions initiated by hostsat ingress A to servers in egress B” or described bymore precise traffic filters defined on the IP 5-tuple(e.g., srcIP=10.1.∗ .∗, dstIP=10.2.∗ .∗, dstport=80, sr-cport=*). For brevity, we assume each class Tc hasa single end-to-end path with the forward and reverseflows within a session following the same path (in op-posite directions). Each application m subscribes toone or more of these traffic classes; i.e., Tm ∈ 2T .

4It is easy to extend to applications that operate at per-packet orper-flow granularity; we do not discuss this for brevity.

• BoxSpec: This captures the hardware capabilities ofthe middlebox hardware: Provn,r is the amount of re-source r (e.g., CPU, memory) that node n is provi-sioned, in units suitable for that resource. Each plat-form may optionally support specialized accelerators(e.g., GPU units or crypto co-processors).

Given the hardware configurations, we also needthe (expected) per-session resource footprint, on theresource r, of running the application m. Each m mayhave some affinity for hardware accelerators; e.g.,some IDSes use hardware-based DPI. These require-ments may be strict (i.e., the application only workswith hardware support) or opportunistic (i.e., offloadfor better performance). Now, the middlebox hard-ware at each node n may or may not have such accel-erators. Thus, we use generalized resource footprintsFm,r,n that depend on the specific middlebox node toaccount for the presence or absence of such hardwareaccelerators. Specifically, the footprint will be higheron a node without an optional hardware acceleratorand the application needs to emulate this feature insoftware.5

In practice, these inputs are already available or easyto obtain. The NetworkSpec for routing and traffic infor-mation is collected for other network management appli-cations [20]. The traffic classes and policy constraints inAppSpec and the hardware capacities Provn,rs are known;we simply require that these be made available to the net-work controller. The only component that imposes neweffort is the set of Fm,r,n values in BoxSpec. These canbe obtained by running offline benchmarks similar to §7;even this effort is required infrequently (e.g., only afterhardware upgrades).

4.2 Problem FormulationGiven these inputs, the controller’s goal is to assign pro-cessing responsibilities to middleboxes across the net-work. There are three high-level types of constraints thatthis assignment should satisfy:

1. Processing coverage: We need to ensure that eachsession of interest to middlebox application m willbe processed by an instance of m along that session’srouting path.

2. Policy dependencies: For each session, we have to re-spect the policy ordering constraints (e.g., firewall be-fore proxy) across middlebox applications that need toprocess this session.

3. Reuse dependencies: We need to model the potentialfor reusing common actions across middlebox appli-cations (e.g., session reassembly in Figure 2).

5To capture strict requirements, where some application cannot runwithout a hardware accelerator, we set the F values for nodes withoutthis accelerator to ∞ or some large value.

m1 m2

common

Footprint on resource Dependency on action

HTTP UDP

HTTP NFS

2 3 1 1

resource1 resource2

HyperApp1=m1 & m2

HyperApp2 =m1

HyperApp3= m2

4 3

1 3

4 1

resource1 resource2 resource1 resource2

resource1 resource2

HTTP

UDP

NFS

Figure 5: Each hyperapp is a single logical task whosefootprint is equivalent to taking the logical union of itsconstituent actions.

Given these constraints, we can consider differentmanagement objectives: (1) minimizing the cost to pro-vision the network, min∑n,r Provn,r, to handle a given setof traffic patterns, or (2) load balancing to minimize themaximum load across the network, minmaxn,r{loadn,r},for the current workload and provisioning regime.

Unfortunately, the reuse and policy dependenciesmake this optimization problem intractable.6 So, we con-sider a practical, but constrained, operating model thateliminates the need to explicitly capture these dependen-cies. The main idea is that all applications pertaining toa given session run on the same node. That is, if somesession i needs to be processed by applications m1 andm2 (and nothing else), then we force both m1 and m2 toprocess session i on the same node. As an example, let usconsider two applications: m1 (say IDS) processes HTTPand UDP traffic and m2 (say WAN-optimizer) processesHTTP and NFS traffic. Now, consider a HTTP session i.In theory, we could run m1 on node n1 and m2 on noden2 for this session i. Our model constrains both m1 andm2 for session i run on n1. Note that we can still assigndifferent sessions to other nodes. That is, for a differentHTTP session i′, m1 and m2 could run on n2.

Having chosen this operational model, for each traf-fic class c we identify the exact sequence of applicationsthat need to process sessions belonging to c. We call eachsuch sequence a hyperapp. Formally, if hc is the hyper-app for the traffic class c, then ∀m : Tc ∈ Tm ⇔ m ∈ hc.(Note that different classes could have the same hyper-app.) Figure 5 shows the three hyperapps for the previ-ous example: one for HTTP traffic (processed by both m1and m2), and one each for UDP and NFS traffic processedby either m1 or m2 but not both. Each hyperapp also stat-ically defines the policy order across its constituent ap-plications.

This model has three practical advantages. First, iteliminates the need to capture the individual actionswithin a middlebox application and their reuse depen-dencies. Similar to the per-session resource footprintFm,r,n of the middlebox application m on resource r, we

6We show the precise formulation in a technical report [35].

can define the per-session hyperapp-footprint of the hy-perapp h on resource r as Fh,r,n. This implicitly ac-counts for the common actions across applications withinh. Note that the right hand side of Figure 5 does not showthe common action; instead, we include the costs of thecommon action when computing the F values for eachhyperapp. Identifying the hyperapps and their F valuesrequires a pre-processing step that takes exponential timeas a function of the number of applications. Fortunately,this is a one-time task and there are only a handful (< 10)of applications in practice.

Second, it obviates the need to explicitly model theordering constraints across applications. Because all ap-plications relevant to a session run on the same node, en-forcing policy ordering can be implemented as a localscheduling decision on each CoMb box (§5).

Third, it simplifies the traffic model. Instead of con-sidering the coverage on a per-session basis, we considerthe total volume of traffic in each class. Thus, we canconsider the management problem in terms of decidingthe fraction of traffic belonging to the class c that eachnode n has to process (i.e., run the hyperapp hc). Letdc,n denote this fraction and let |Tc| denote the volume oftraffic for class c.

The optimization problem can be expressed by the lin-ear program shown in Eq(1)—Eq(4). (For brevity, weshow only the load balancing objective.) Eq(2) modelsthe stress or load on each resource at each node in termsof the aggregate processing costs (i.e., product of thetraffic volume and the footprints) assigned to this node.Here, n ∈path c denotes that node n is on the routing pathfor the traffic in Tc. Eq(3) simply specifies a coverageconstraint so that the fractional responsibilities across thenodes on the path for each class c add up to 1.

Minimizemaxr,n{loadn,r}, subject to (1)

∀n,r : loadn,r = ∑c:n∈pathc

dc,n|Tc|Fhc,r,n

Provn,r(2)

∀c : ∑n∈pathc

dc,n = 1 (3)

∀c,n : 0≤ dc,n ≤ 1 (4)

The controller solves this optimization to find the op-timal set of dc,n values specifying the per-class responsi-bilities of each middlebox node. Then it maps these val-ues into device-level configurations per middlebox. Wedefer a discussion of the mapping step to §6.

5 CoMb Single-box DesignWe now turn to the design of a single CoMb box. As de-scribed in the previous section, the output of the networkcontroller is an assignment of processing responsibilitiesto each CoMb box. This assignment specifies:

… Virtual Queues

NIC

Policy Shim (Pshim)

Core1 Core4

M1 M7

…

… M2 M4

Traffic

Apps

Policy Enforcer

Classification

Figure 6: Logical view of aCoMb box with three lay-ers: classification, policyenforcement, and middle-box applications

M1 M2

M1 M2

PShim

App-per-core

HyperApp-per-core

PShim

M2 M3

PShim

Core3

M3

PShim

Core1 Core2

Core1 Core2

Figure 7: An examplewith two hyperapps: m1followed by m2 and m2followed by m3. Thehyperapp-per-core archi-tecture clones m2.

• a set of (traffic class, fraction) pairs {(Tc, dc,n)} thatdescribes what traffic (type and volume) each CoMbbox needs to process.

• the hyperapp hc associated with each traffic class Tc,where each hyperapp is an ordered set of one or moremiddlebox applications.We start with our overall system architecture and

then describe how we parallelize this architecture overa CoMb box’s hardware resources.

5.1 System ArchitectureAt a high level, packet processing within a CoMb boxcomprises three logical stages as shown in Figure 6. Anincoming packet must first be classified, to identify whattraffic class Tc it belongs to. Next, the packet is handedto a policy enforcement layer responsible for steering thepacket between the different applications correspondingto the packet’s traffic class, in the appropriate order. Fi-nally, the packet is processed by the appropriate middle-box application(s). Of these, classification and policyenforcement are a consequence of our consolidated de-sign and hence we aim to make these as lightweight aspossible.We elaborate on the role and design options foreach stage next.

Classification: The CoMb box receives a stream of un-differentiated packets. Since different packets may beprocessed by different applications, we must first iden-tify what traffic class a packet belongs to. There are twobroad design options here. The first is to do the classi-fication in hardware. Many commercial appliances relyon custom NICs for sophisticated high-speed classifica-tion and even commodity server NICs today support suchcapabilities [4]. A common feature across these NICsis that they support a large number of hardware queues(on the NIC itself) and can be configured to triage in-

coming packets into these queues using certain functions(typically exact-, prefix- and range-matches) defined onthe packet headers. The second option is software-basedclassification—incoming packets are classified entirelyin software and placed into one of multiple softwarequeues.

The tradeoff between the two options is one of ef-ficiency vs. flexibility. Software classification is fullygeneral and programmable but consumes significant pro-cessing resources; e.g., Ma et al. report general software-based classification at 15 Gbps (comparable to a com-modity NIC) on a 8-core Intel Xeon X5550 server [28].

Our current implementation assumes hardware classi-fication. From an architectural standpoint, however, thetwo options are equivalent in the abstraction they exposeto the higher layers: multiple (hardware or software)queues with packets from a traffic class Tc mapped toa dedicated queue.

We assume that the classifier has at least as manyqueues as there are hyperapps. This is reasonable sinceexisting commodity NICs already have 128/256 queuesper interface, specialized NICs even more, and software-based classification can define as many as needed. Forexample, with 6 applications, the worst-case number ofhyperapps and virtual queues is 26 = 64, which today’scommodity NICs can support.Policy Enforcer: The job of the policy enforcementlayer is to ‘steer’ a packet in the correct order betweenthe different applications associated with the packet’s hy-perapp. We need such a layer because the applications ona CoMb box could come from independent vendors andwe want to run applications such that they are obliviousto our consolidation. Hence, for a hyperapp comprised of(say) IDS followed by Proxy, the IDS application wouldnot know to send the packet to the Proxy for further pro-cessing. Since we do not want to modify applications,we introduce a lightweight policy shim (pshim) layer.

We leverage the above classification architecture to de-sign a very lightweight policy enforcement layer. Wesimply associate a separate instance of a pshim with eachoutput queue of the classifier. Since each queue onlyreceives packets for a single hyperapp, the associatedpshim knows that all the packets it receives are to be“routed” through the identical sequence of applications.

Thus, beyond retaining the sequence of applicationsfor its associated hyperapp/traffic-class, the pshim doesnot require any complex annotation of packets or keepper-session state. In fact, if the hyperapp consists of asingle application, the pshim is essentially a NOP.Applications: Our design supports two application soft-ware architectures: (1) standalone software processes(that run with little or no modification) and (2) ap-plications built atop an ‘enhanced’ network stack withreusable software modules for common tasks such as ses-

sion reconstruction and protocol parsing. We currentlyassume that applications using custom accelerators ac-cess these using their own libraries.

5.2 Parallelization on a CoMb boxWe assume a CoMb box offers a number of parallelcomputation cores—such parallelism exists in general-purpose servers (e.g., our prototype server uses 8 In-tel Xeon ‘Westmere’ cores) and is even more prevalentin specialized networking hardware (e.g., Cisco’s Quan-tumFlow packet processor offers 40 Tensilica cores). Wenow describe how we parallelize the functional layers de-scribed earlier on this underlying hardware.

Parallelizing the classifier: Since we assumed hard-ware classification, our classifier runs on the NIC anddoes not require parallelization across cores. We referthe reader to [28] for a discussion of how a software-based classifier might run on a multi-core system.

Parallelizing a single hyperapp: Recall that a hyper-app is a sequence of middlebox applications that need toprocess a packet. There are two options to map a logicalhyperapp to the parallel hardware (Figure 7):1. App-per-core: Each application in the hyperapp runs

on a separate core and the pshim steers each packetbetween cores.

2. hyperapp-per-core: All applications belonging to thehyperapp run on the same core. Hence, a given appli-cation module is cloned with as many instances as thenumber of hyperapps in which it appears.The advantage of the hyperapp-per-core approach is

that a packet is processed in its entirety on a singlecore, avoiding the overhead of inter-core communicationand cache invalidations that may arise as shared stateis accessed by multiple cores. (This overhead occursmore frequently for applications built to reuse process-ing modules in a common stack.) The disadvantage ofthe hyperapp-per-core relative to the app-per-core, is thatit could incur overhead due to context switches and po-tential contention over shared resources (e.g., data andinstruction caches) on a single core. Which way the scaletips depends on the overheads associated with inter-corecommunication, context switches, etc. which vary acrosshardware platforms.

We ran several tests across different hyperapp sce-narios on our prototype server (§7) and found that thehyperapp-per-core approach offered superior or compa-rable performance [35]. These results are also con-sistent with independent measurements on softwarerouters [17]. In light of these results, we choose thehyperapp-per-core model because it simplifies how weparallelize the pshim (see below) and ensures core-localaccess to reusable modules and data structures.

Hyper App1

Hyper App2

Hyper App4

Hyper App3

Hyper App3

PShim PShim PShim PShim PShim

M1 M4 M1 M4 M2 M3

Q1 Q3 Q2 Q4 Q5

M1 M5

Core 1 Core 3 Core 2

NIC hardware

Figure 8: CoMb box: Putting the pieces together

Parallelizing the pshim layer: Recall that we have aseparate instance of a pshim for each hyperapp. Giventhe hyperapp-per-core approach, parallelizing the pshimlayer is easy. We simply run the pshim instance on thesame core as its associated hyperapp.

Parallelizing multiple hyperapps: We are left with oneoutstanding question. Given multiple hyperapps, howmany cores, or fraction of a core, should we assign eachhyperapp? For example, the total workload for some hy-perapp might exceed the processing capacity of a singlecore. At the same time, we also want to avoid a skewedallocation across cores. This hyperapp-to-core mappingproblem can be expressed as a simple linear program thatassigns a fraction of the traffic for each hyperapp h toeach core. (We do not show it for brevity; please refer ourtechnical report [35].) In practice, this calculation neednot occur at the CoMb box as the controller can also runthis optimization and push the resulting configuration.

5.3 Recap and DiscussionCombining the previous design decisions brings us to thedesign in Figure 8. We see that:• Incoming packets are classified at the NIC and placed

into one of multiple NIC queues; each traffic class isassigned to one or more queues and different trafficclasses are mapped to different queues.

• All applications within a hyperapp run on the samecore. Hyperapps whose load exceeds a single core’scapacity are instantiated on multiple cores (e.g., Hy-perApp3 in Figure 8). Each core may be assigned oneor more hyperapps.

• Each hyperapp instance has a corresponding pshim in-stance running on the same core and each pshim readspackets from a dedicated virtual NIC queue. For ex-ample, HyperApp3 in Figure 8 runs on Core2 andCore3 and has two separate pshims.7

The resulting design has several desirable properties con-ducive to achieving high performance:• A packet is processed in its entirety on a single core

(avoiding inter-core synchronization overheads).

7The traffic split between the two instances of HyperApp3 also oc-curs in the NIC using filters as in §6.1.

• We introduce no shared data structures across cores(avoiding needless cache invalidations).

• There is no contention for access to NIC queues(avoiding the overhead of locking).

• Policy enforcement is lightweight (stateless and re-quiring no marking or modification of packets).

6 ImplementationNext, we describe how we prototype the different com-ponents of the CoMb architecture.

6.1 CoMb ControllerWe implement the controller’s algorithms using an off-the-shelf solver (CPLEX). The controller runs a pre-processing step to generate the hyperapps and their effec-tive resource footprints taking into account the affinity ofapplications for specific accelerators. The controller pe-riodically runs the optimization step that takes as inputsthe current per-application-port traffic matrix (i.e., peringress-egress pair), the traffic of interest to each appli-cation, the cross-application policy ordering constraints,and the resource footprints per middlebox module.

After running the optimization, it maps the dc,n valuesto device-level configurations as follows. If the CoMbbox supports in-hardware classification (like our proto-type server) and has a sufficient number of filter entries,the controller maps the dc,n values into a set of non-overlapping traffic filters. As a simple example, sup-pose c denotes all traffic from sources in 10.1.0.0/16to destinations in 10.2.0.0/16, and dc,n1 = dc,n2 = 0.5.Then the filters for n1 : 〈10.1.0.0/17,10.2.0.0/16〉 andn2 : 〈10.1.128.0/17,10.2.0.0/16〉.8 One subtle issue isthat it also installs filters corresponding to traffic in thereverse direction. Note that if each CoMb box is off-path, these filters can be pushed to the upstream router orswitch.

If the NIC does not support such expressive filters, orhas a limited number of filter entries (e.g., if the numberof prefix pairs is very high in a large network), the con-troller falls back to a hash-based configuration [37]. Inthis case, the basic classification to identify the requiredhyperapp (say based on the port numbers) still happens atthe NIC. The subsequent decision on whether this nodeis responsible for processing this session happens in thesoftware pshim. Each device’s pshim does a fixed-length(/16) prefix lookup, computes a direction-invariant hashof the IP 5-tuple [39], and checks if this hash falls in itsassigned range. For the above example, the configurationwill be n1 : 〈10.1.0.0/16,10.2.0.0/16,hash ∈ [0,0.5]〉and n2 : 〈10.1.0.0/16,10.2.0.0/16,hash ∈ [0.5,1]〉.

8This simple example assumes a uniform distribution of traffic perprefix block. In practice, the prefixes can be weighted by expectedtraffic volumes inferred from past measurements.

Packet Capture

Session Reconstruc2on

Cache Signature IDS FlowMon

HTTP

Load Balancer

TFTP NFS

Applications

Reusable modules

Figure 9: Our modular middlebox implementation

6.2 CoMb box prototypeWe prototype a CoMb box on a general-purpose server(without accelerators) with two Intel(R) Xeon(R) ‘West-mere’ CPUs each with four cores at 3.47GHz (X5677)and 48GB memory, configured with four Intel(R) 8259910 GigE NIC ports [4] each capable of supporting up to128 queues, running Linux (kernel v.2.6.24.7).

Classification: We leverage the classification capabili-ties on the NIC. The NIC demultiplexes packets into sep-arate in-hardware queues per hyperapp based on the fil-ters from the controller. The 82599 NIC supports 32Kclassification entries over: src/dst IP addresses, src/dstports, protocol, VLAN header, and a flexible 2-byte tu-ple anywhere in the first 64 bytes of the packet [4]. Weuse the address and port fields to create filter entries.

Policy Enforcer: We implement the pshim in kernel-mode SMP-Click [30] following the design in §5. Inaddition to the policy routing, the pshim implementstwo additional functions: (1) creating interfaces for theapplication-level processes to receive and send packets(see below) and (2) the optional hash-based check to de-cide whether to process a specific packet.

6.3 CoMb applicationsOur prototype supports two application architectures:modular middlebox applications written in Click andstandalone middlebox processes (e.g., Snort, Squid).

Modular middlebox applications: As a proof-of-concept prototype, we implement several canonical mid-dlebox applications: signature-based intrusion detection,flow-level monitoring, a caching proxy, and a load bal-ancer as (user-level) modules in Click as shown in Fig-ure 9. As such, our focus is to demonstrate the feasibilityof building modular middlebox applications and estab-lish the potential for reuse. We leave it to future work toexplore the choice of an ideal software architecture andan optimal set of reusable modules.

To implement these applications, we port the logic forsession reconstruction and protocol parsers (e.g., HTTPand NFS) from Bro [32]. We implement a custom flowmonitoring system. Our signature-based IDS uses Bro’ssignature matching module. We also built a custom Clickmodule for parsing TFTP traffic. The load balancer is

Application Dependency chain Contribution (%)Flowmon Session 73Signature Session 26

Load Balancer HTTP,Session 88Cache HTTP,Session 54Cache NFS,Session 50Cache TFTP,Session 36

Table 2: Contribution of reusable modules

a layer-7 application that assigns HTTP requests to dif-ferent backend servers by rewriting packets. The cachemimics actions in a caching proxy (i.e., storing and look-ing up requests in cache), but does not rewrite packets.

While Bro’s modular design made it a very usefulstarting point, its intended use is as a standalone IDSwhile CoMb envisions reusing modules across multipleapplications from different vendors. This leads to onekey architectural difference. Modules in Bro are tightlyintegrated; lower layers are aware of the higher layersusing them and “push” data to them. We avoid this tightcoupling between the modules and instead implement a“pull” model where lower layers expose well-defined in-terfaces using which higher-layer functions obtain rele-vant data structures.

Supporting standalone applications: Last, we focuson how a CoMb box supports standalone middlebox ap-plications (e.g., Snort, Squid). We run standalone appli-cations as separate processes that can access packets inone of two modes. If we have access to the applicationsource, we modify the packet capture routines; e.g., inSnort we replace libpcap calls with a memory readto a shared memory region into which the pshim copiespackets. For applications where we do not have access tothe source, we simply create virtual network interfacesand the pshim writes to these interfaces. The formerapproach is more efficient but requires source modifica-tions; the latter is less efficient but allows us to run legacysoftware with no modifications.

7 EvaluationOur evaluation addresses the following high-level ques-tions regarding the benefits and overheads of CoMb:• Single-box benefits: What reuse benefits can consol-

idation provide? (§7.1)• Single-box overhead: Does consolidating applica-

tions affect performance and extensibility? (§7.2)• Network-wide benefits: What benefits can network

administrators realize using CoMb? (§7.3)• Network-wide overhead: How practical and efficient

is CoMb’s controller? (§7.4)

7.1 Potential for reuseFirst, we measure the potential for processing reuseachievable by refactoring middlebox applications. As

§3.2 showed, the savings from reuse depend both onthe processing footprints of reusable modules and theexpected amount of traffic overlap. Here, we focusonly on the former and defer the combined effect tothe network-wide evaluation (§7.3). We use real packettraces with full payloads for these benchmarks.9 Becausewe are only interested in the relative contribution, werun these benchmarks with a single userlevel thread inClick. We use PAPI10 to measure the number of CPUcycles per-packet each module uses. Note that an appli-cation like Cache uses different processing chains (e.g.,Cache-HTTP-session vs. Cache-NFS-session); the rela-tive contribution depends on the exact sequence. Table 2shows that the reusable modules contribute 26–88% ofthe overall processing across the different applications.

7.2 CoMb single-box performanceWe tackle three concerns in this section: (1) What over-head does CoMb add for running individual applica-tions? (2) Does CoMb scale well as traffic rates increase?and (3) Does application performance suffer when ad-ministrators want to add new functionality?

For the following experiments, we report throughputmeasurements using the same full-payload packet tracesfrom §7.1 on our prototype CoMb server with two In-tel Westmere CPUs each with four cores at 3.47GHz(X5677) and 48GB memory. The results are consistentwith other synthetic traces as well.

7.2.1 Shim OverheadRecall from §6 that CoMb supports two types of middle-box software: (1) standalone applications (e.g., Snort),and (2) modular applications in Click. Table 3 showsthe overhead of running a representative middlebox ap-plication from each class on a single core in our plat-form. We show two scenarios, one where all classifica-tion occurs in hardware (labeled shim-simple) and whenthe pshim runs an additional hash-based check as dis-cussed in §6 (labeled shim-hash). For middlebox mod-ules in Click, shim-simple imposes zero overhead. Inter-estingly, the throughput for Snort is better than its nativeperformance. The reason is that Click’s packet captureroutines are more efficient than native Snort (libpcapor daq). We also see that shim-hash adds only a smalloverhead over shim-simple. This result confirms that run-ning applications in CoMb imposes minimal overhead.

7.2.2 Performance under consolidationNext, we study the effect of adding more cores andadding more applications. For brevity, we only showresults for shim-simple. For these experiments, we use

9From https://domex.nps.edu/corp/scenarios/2009-m57/net/; weare not aware of other traces with full payloads.

10http://icl.cs.utk.edu/papi/

Application architecture Overhead (%)(instance) Shim-simple Shim-hash

Standalone (Snort) -61 -58Modular (IPSec) 0 0.73

Modular (RE [11]) 0 0.62

Table 3: Performance overhead of the shim layer for dif-ferent middlebox applications

1000

4000

7000

10000

2 3 4 5 6 7 8

Thro

ughp

ut (M

bps)

Number of cores available

VM-basedCoMb

Figure 10: Throughput vs. number of cores

a standalone application process using the Snort IDS. Toemulate adding new functionality, we create duplicate in-stances of Snort. (We find similar results with heteroge-neous applications too.) At a high-level, we find that con-solidation in CoMb does not introduce contention bottle-necks across applications.

As a point of comparison, we also evaluate a virtualappliance architecture [22], where each Snort instanceruns in a separate VM on top of the Xen hypervisor. Toprovide high I/O throughput to the VM setup, we use theSR-IOV capability in the hardware [8] and the vSwitch-ing capability of the NIC to transfer packets betweenapplication instances [4]. We confirmed that I/O wasnot a bottleneck; we were able to achieve a throughputof around 7.8 Gbps on a single VM with a single CPUcore which is consistent with state-of-art VM I/O bench-marks [40]. As in §5.2, we need to decide between theapp-per-core vs. hyperapp-per-core design for the VMsetup. We saw that app-per-core is roughly 2× better forthe VM case because context switches between VMs areexpensive and packet switching between VMs is in hard-ware (i.e. vSwitching). Thus, we conservatively use theapp-per-core design for the VM setup.

Figure 10 shows the effect of adding more cores to theplatform with a fixed hyperapp consisting of two Snortprocesses in sequence. We make three main observa-tions. First CoMb’s throughput with this real IDS/IPSis ≈ 10 Gbps on a 8-core platform which is comparableto vendor datasheets [2]. Second, CoMb exhibits a rea-sonable scaling property similar to prior results on multi-core platforms [14]. This suggests that adapting CoMb tohigher traffic rates simply requires more processing coresand does not need significant re-engineering. Finally,CoMb’s throughput is 5× better than the VM case. Thisperformance gap arises out of a combination of threefactors. First, Snort atop the pshim runs significantlyfaster than native Snort because Click offers more effi-

cient packet capture as we saw in Table 3. Second, run-ning Snort inside a VM has roughly 30% lower through-put than native Snort. Third, our hyperapp model amor-tizes the fixed cost of copying packets into the applica-tion layer whereas the VM-based setup incurs multiplecopies. While the performance of virtual network appli-ances is under active research, these results are consistentwith benchmarks for virtual appliances [1].

We also evaluated the impact of running more appli-cations per-packet. The ideal degradation when we run kapplications is a 1

k curve because running k applicationsneeds k-times as much work. We found that both CoMband the VM-based setup have a near-ideal throughputdegradation (now shown). This confirms that CoMb al-lows administrators to easily add new middlebox func-tionality in response to policy or workload changes.

7.3 Network-wide Benefits

Setup: Next, we evaluate the network-wide benefits thatCoMb offers via reuse, multiplexing, and spatial distri-bution. For this evaluation, we use real topologies fromeducational backbones and the Enterprise network, andPoP-level AS topologies from Rocketfuel [38]. To obtainrealistic time-varying traffic patterns, we use the follow-ing approach. We use traffic matrices for Internet211 tocompute empirical variability distributions for each ele-ment in a traffic matrix; e.g., the probability that the vol-ume is between 0.6 and 0.8 the mean. Then, using theseempirical distributions, we generate time-varying trafficmatrices for the remaining AS-level topologies using agravity model to capture the mean volume [34]. For theEnterprise network, we replay real traffic matrices.

In the following results, we report the benefits thatCoMb provides relative to today’s standalone middle-box deployments with the four applications from Table 2:flow monitoring, load balancer, IDS, and cache. To em-ulate current deployments, we use the same applicationsbut without reusing modules. For each application, weuse public configurations to identify the application portsof traffic they process. To capture changes in per-portvolume over time, we replay the empirical variabilitybased on flow-level traces from Internet2. We begin witha scenario where all four applications can be spatiallydistributed and then consider a scenario when two of theapplications are topologically constrained.

Provisioning: We consider a provisioning exerciseto minimize the resources needed to handle the time-varying traffic patterns generated as described earlieracross 200 epochs. The metric of interest here is the rela-tive savings that CoMb provides vs. today’s deploymentswhere all applications run as independent devices only at

11http://www.cs.utexas.edu/˜yzhang/research/AbileneTM

1

1.4

1.8

2.2

2.6

AbileneGeant

EnterpriseAS1221

AS3257AS1239

Rel

ativ

e sa

ving

s With reuseWithout reuse

Figure 11: Reduction in provisioning cost with CoMb

1

1.4

1.8

2.2

2.6

AbileneGeant

EnterpriseAS1221

AS3257AS1239

Rel

ativ

e sa

ving

s PathEdge

Ingress

Figure 12: Impact of spatial distribution on CoMb’s re-duction in provisioning cost

the ingress: Coststandalone,ingressCostCoMb

. The Cost term here repre-sents the total cost of provisioning the network to handlethe given set of time-varying workloads (i.e., ∑n,r Provn,rfrom §4).

We try two CoMb configurations: with and withoutreusable modules. In the latter case, the middlebox ap-plications share the same hardware but not software. Fig-ure 11 shows that across the different topologies CoMbwith reuse provides 1.8–2.5× savings relative to today’sdeployment strategies. For the Enterprise setting, CoMbeven without reuse provides close to 1.8× savings.

Figure 12 studies the impact of spatial distribution bycomparing three strategies for distributing middlebox re-sponsibilities: full path (labeled Path), either ingress oregress (labeled Edge), or only the Ingress. Interestingly,Edge is very close to Path. To explore this further, wealso tried a strategy of picking a random second nodefor each path. We found that this is again very close toPath (not shown). In other words, for Edge the egress isnot special; the key is having one more node to distributethe load. We conjecture that this is akin to the “powerof two random choices” observation [29] and plan to ex-plore this in future work.

Load balancing: CoMb also allows middlebox deploy-ments to better adapt to changing traffic workloads undera fixed provisioning strategy. Here, our metric of interestis the maximum load across the network, and we measurethe relative benefit as: MaxLoadstandalone,ingress

MaxLoadCoMb. We consider

two network-wide provisioning strategies where each lo-cation is provisioned with the same resources (labeledUniform) or proportional to the average volume it sees(labeled Volume). For the standalone case, we assumeresources are split between applications proportional totheir workload. Note that the combination of volume andworkload proportional provisioning likely reflects cur-rent practice. We also consider the Uniform case be-

0

5

10

15

20

25

AbileneGeant

EnterpriseAS1221

AS3257AS1239

Rel

ativ

e sa

ving

s

UniformVolume

Figure 13: Relative reduction in the maximum load

Topology Unconstrained Two-step Ingress-onlyInternet2 1.81 1.62 1.41

Geant 2.20 1.71 1.42Enterprise 2.58 1.76 1.45AS1221 2.17 1.69 1.41AS3257 1.85 1.63 1.42AS1239 2.11 1.69 1.43

Table 4: Relative savings in provisioning when Cacheand Load balancer are spatially constrained

cause it is unclear if the proportional allocation strategyis always better for today’s deployments; e.g., it could bebetter on average, but have worse “tail” performance asshown in Figure 13.

As before, we generate time-varying traffic patternsover 200 epochs and measure the above relative loadmetric per epoch. For each topology, Figure 13 showsthe distribution of this metric (across epochs) using abox-and-whiskers plot with the 25th, 50th, and 75th per-centiles, and the minimum and maximum values. Theresult shows that CoMb reduces the maximum load by> 2× and the reduction can be as high as 25×, confirm-ing that CoMb can better handle traffic variability com-pared to current middlebox deployments.

Topological constraints: Next, we consider a scenariowhen some applications cannot be spatially distributed.Here, we constrain Cache and the Load balancer to runat the ingress for each path. One option in this scenariois to pin all middlebox applications to the ingress (to ex-ploit reuse) but ignore spatial distribution. While CoMbprovides non-trivial savings (1.4×) even in this case, thisignores opportunities for further savings. To this end, weextend the formulation from §4.2 to perform a two-stepoptimization. In the first step, we assign the topologicallyconstrained applications to their required locations. Inthe second, we assign the remaining applications, whichcan be distributed, with a slight twist. Specifically, we re-duce the hyperapp-footprints on locations where they canreuse modules with the constrained applications. For ex-ample, if we have the hyperapp Cache-IDS, with Cachepinned to the ingress, we reduce the IDS footprint onthe ingress. Table 4 shows that this two-step procedureis able to improve the savings 20–30% compared to aningress-only solution. This preliminary analysis suggeststhat CoMb can work even when some applications are

Topology Path Edge IngressInternet2 0.87 0.87 0.54

Geant 1.49 1.25 0.55Enterprise 1.02 1.02 0.54AS1221 1.33 1.33 0.54AS3257 0.68 0.68 0.55AS1239 1.26 1.26 0.55

Table 5: Relative size of the largest CoMb box. A highervalue here means that the standalone case needs a largerbox compare to CoMb

Topology #PoPs Time (s)Strawman-LP hyperapp

Internet2 11 687.68 0.05Geant 22 3455.28 0.24

Enterprise 23 2371.87 0.25AS3257 41 1873.32 0.78AS1221 44 3145.77 1.08AS1239 52 9207.78 1.58

Table 6: Time to compute the optimal solution

topologically constrained. As future work, we plan to ex-plore a detailed analysis of such topological constraints.

Does CoMb need bigger boxes? A final concern is thatconsolidation may require “beefier” boxes; e.g., in thenetwork core. To address this concern, Table 5 comparesthe processing capacity of the largest standalone boxneeded across the network to that of the largest CoMbbox: Largeststandalone

LargestCoMb. We see that the largest standalone

box is actually larger than CoMb for many topologies.Even without distribution, the largest CoMb box is only

10.55 = 1.8×, which is quite manageable.

7.4 CoMb controller performanceLast, we focus on two key concerns surrounding the per-formance of the CoMb network controller: (1) Is the op-timization fast enough to respond to traffic dynamics (onthe order of a few minutes)? and (2) How close to thetheoretical optimal is our formulation from §4.2?

Table 6 shows the time to run the optimization fromSection 4 using the CPLEX LP solver on a single coreIntel(R) Xeon(TM) 3.2GHz CPU. To put our formulationin context, we also show the time to solve an LP relax-ation for a precise model that captures reuse and policydependencies on a per-session and per-action basis [35].(The precise model is an intractable discrete optimiza-tion problem; we use its LP relaxation as a proxy.) Theresult shows that our formulation is almost four ordersof magnitude faster than the precise model. Given thatwe expect to recompute configurations on the order of afew minutes [20], the compute times (1.58s for a 52-nodetopology) are reasonable.

We also measured the optimality gap between the pre-cise formulation [35] and our practical approach over a

range of scenarios. Across all topologies, the optimal-ity gap is ≤ 0.19% for the load balancing and ≤ 0.1%for the provisioning (not shown). Thus, our formulationprovides a tractable, yet near-optimal, alternative.

8 DiscussionThe consolidated middlebox architecture we envisionraises two deployment concerns that we discuss next.

First, CoMb changes existing business and operatingmodels for middlebox vendors as it envisions vendorsthat decouple middlebox software and hardware and alsothose who refactor their applications to exploit reuse.We believe that the qualitative (i.e., extensibility, re-duced sprawl, and simplified management) and quanti-tative (i.e., lower provisioning costs and better resourcemanagement) advantages that CoMb offers will motivatevendors to consider product offerings in this space. Fur-thermore, evidence suggests vendors are already rethink-ing the software-hardware coupling and starting to offersoftware-only “virtual appliances” (e.g., [7]). We alsospeculate that some middlebox vendors may already in-ternally have modular middlebox stacks. CoMb simplyrequires one or more of these vendors to provide openAPIs to these modules to encourage further innovation.

Second, running multiple middlebox applications onthe same platform raises concerns about isolation withrespect to performance (e.g., contention for resources),security (e.g., the NIDS/firewall must not be compro-mised), and fault tolerance (e.g., a faulty applicationshould not crash the whole system). With respect toperformance, concurrent work shows that contention hasminimal impact on throughput on x86 hardware forthe types of network processing workloads we envi-sion [16]. In terms of fault tolerance and security, pro-cess boundaries already provide some degree of isolationand techniques such as containers can give stronger prop-erties [5]. There are two challenges with such sandbox-ing. The first is ensuring the context switching overheadsare low. Second, even though CoMb without reusingmodules provides significant benefits, it would be use-ful to provide isolation without sacrificing the benefitsof reuse. We also note that running applications in userspace can further insulate misbehaving applications. Inthis light, recent results showing the feasibility of highperformance network I/O in the user space are promis-ing [33].

9 Related WorkIntegrating middleboxes: Previous work discussesmechanisms to better expose middleboxes to administra-tors (e.g., [6]). Similarly, Joseph et al. describe a switch-ing layer for integrating middleboxes in datacenters [26].CoMb focuses on the orthogonal problem of consolidat-ing middlebox deployments.

Middlebox measurements: Studies have measured theend-to-end impact of middleboxes [10] and interactionswith transport protocols [24]. The measurements in §2and high-level opportunities in §3 appeared in an earlierworkshop paper [36]. This work goes beyond the moti-vation to demonstrate a practical design and implementa-tion and quantifies the single-box and network-wide ben-efits of a consolidated middlebox architecture.

General-purpose network elements: There are manyefforts to build commodity routers and switches usingx86 CPUs [18, 19, 22], GPUs [23], and merchant switchsilicon [27]. CoMb can benefit from these advances aswell. It is worth noting that the challenges we address inCoMb also apply to these efforts, if the extensibility theyenable leads to diversity in traffic processing.

Rethinking middlebox design: CoMb shares themotivation of rethinking middlebox design with Flow-stream [22] and xOMB [12]. FlowStream presents ahigh-level architecture using OpenFlow for policy rout-ing and runs each middlebox as a VM [22]. UnlikeCoMb, a VM approach precludes opportunities for reuse.Further, as §7.2 shows today’s VM-based solutions haveconsiderably lower throughput. xOMB presents a soft-ware model for extensible middleboxes [12]. CoMb ad-dresses network-wide and platform-level resource man-agement challenges that arise with consolidation that nei-ther FlowStream nor xOMB seek to address. CoMb alsoprovides a more general management framework to sup-port both modular and standalone middlebox functions.

Network management: CoMb’s controller follows inthe spirit of efforts showing the benefits of centralizationin routing, access control, and monitoring (e.g., [21, 31,15]). The use of optimization arises in other networkmanagement applications. However, reuse and policydependencies that arise in the context of consolidatingmiddlebox management create new challenges for man-agement and optimization unique to our context.

10 ConclusionsWe presented a new middlebox architecture calledCoMb, which systematically explores opportunities forconsolidation, both in building individual appliances andin managing an ensemble of these across a network. Inaddition to the qualitative benefits with respect to ex-tensibility, ease of management, and reduction in de-vice sprawl, consolidation provides new opportunitiesfor resource savings via application multiplexing, soft-ware reuse, and spatial distribution. We addressed keyresource management and implementation challenges inorder to leverage these benefits in practice. Using a pro-totype implementation in Click, we show that CoMb re-duces the network provisioning cost by up to 2.5×, de-

creases the load skew by up to 25×, and imposes mini-mal overhead for running middlebox applications.

AcknowledgmentsWe thank Katerina Argyraki, Sangjin Han, and our shep-herd Miguel Castro for their feedback on the paper. Wealso thank Mihai Dobrescu and Vitaly Luban for help-ing us with experiment setup. This work was supportedin part by ONR grant N000141010155, NSF grants0831245 and 1040626, and the Intel Science and Tech-nology Center for Secure Computing.

11 References[1] IBM Network Intrusion Prevention System Virtual Appliance.

http://www-01.ibm.com/software/tivoli/products/virtualized-network-security/requirements.html.

[2] Big-IP hardware datasheet. www.f5.com/pdf/products/big-ip-platforms-ds.pdf.

[3] Crossbeam network consolidation.http://www.crossbeam.com/why-crossbeam/network-consolidation/.

[4] Intel 82599 10 gigabit ethernet.http://download.intel.com/design/network/datashts/82599_datasheet.pdf.

[5] Linux containers.http://lxc.sourceforge.net/man/lxc.html.

[6] Middlebox Communications (MIDCOM) Protocol Semantics.RFC 3989.

[7] Silver Peak releases software-only WAN optimization .http://www.networkworld.com/news/2011/071311-silver-peak-releases-software-only-wan.html.

[8] SR-IOV. http://www.pcisig.com/specifications/iov/single_root/.

[9] World Enterprise Network and Data Security Markets.http://www.abiresearch.com/press/3591-Enterprise+Network+and+Data+Security+Spending+Shows+Remarkable+Resilience.

[10] M. Allman. On the Performance of Middleboxes. In Proc. IMC,2003.

[11] A. Anand, A. Gupta, A. Akella, S. Seshan, and S. Shenker.Packet Caches on Routers: The Implications of UniversalRedundant Traffic Elimination. In Proc. of SIGCOMM, 2008.

[12] J. Anderson. Consistent Cloud Computing Storage as the Basisfor Distributed Applications, chapter 7. University of California,San Diego, 2012.

[13] T. Benson, A. Akella, and A. Shaikh. Demystifyingconfiguration challenges and trade-offs in network-based ispservices. In Proc. SIGCOMM, 2011.

[14] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F.Kaashoek, R. Morris, and N. Zeldovich. An Analysis of LinuxScalability to Many Cores. In Proc. OSDI, 2010.

[15] M. Caesar, D. Caldwell, N. Feamster, J. Rexford, A. Shaikh, andJ. van der Merwe. Design and implementation of a RoutingControl Platform. In Proc. of NSDI, 2005.

[16] M. Dobrescu, K. Argyarki, and S. Ratnasamy. TowardPredictable Performance in Software Packet-ProcessingPlatforms. In Proc. NSDI, 2012.

[17] M. Dobrescu, K. Argyraki, M. Manesh, G. Iannaccone, andS. Ratnasamy. Controlling Parallelism in Multi-core SoftwareRouters. In Proc. PRESTO, 2010.

[18] M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall,G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy.

RouteBricks: Exploiting Parallelism to Scale Software Routers.In Proc. SOSP, 2009.

[19] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt, F. Huici, andL. Mathy. Towards high performance virtual routers oncommodity hardware. In Proc. CoNEXT, 2008.

[20] A. Feldmann, A. G. Greenberg, C. Lund, N. Reingold,J. Rexford, and F. True. Deriving Traffic Demands forOperational IP Networks: Methodology and Experience. InProc. of ACM SIGCOMM, 2000.

[21] A. Greenberg, G. Hjalmtysson, D. A. Maltz, A. Meyers,J. Rexford, G. Xie, H. Yan, J. Zhan, and H. Zhang. A CleanSlate 4D Approach to Network Control and Management. ACMSIGCOMM CCR, 35(5), Oct. 2005.

[22] A. Greenlagh, M. Handley, M. Hoerdt, F. Huici, L. Mathy, andP. Papadimitriou. Flow Processing and the Rise of CommodityNetwork Hardware. ACM CCR, Apr. 2009.

[23] S. Han, K. Jang, K. Park, and S. Moon. PacketShader: aGPU-Accelerated Software Router. In Proc. SIGCOMM, 2010.

[24] M. Honda, Y. Nishida, C. Raiciu, A. Greenhalgh, M. Handley,and H. Tokuda. Is it still possible to extend TCP? In Proc. IMC,2011.

[25] D. Joseph and I. Stoica. Modeling middleboxes. IEEE Network,2008.

[26] D. A. Joseph, A. Tavakoli, and I. Stoica. A Policy-awareSwitching Layer for Data Centers. In Proc. SIGCOMM, 2008.

[27] G. Lu, C. Guo, Y. Li, Z. Zhou, T. Yuan, H. Wu, Y. Xiong,R. Gao, and Y. Zhang. ServerSwitch: A Programmable and HighPerformance Platform for Data Center Networks. In Proc. NSDI,2011.

[28] Y. Ma, S. Banerjee, S. Lu, and C. Estan. Leveraging Parallelismfor Multi-dimensional Packet Classification on SoftwareRouters. In Proc. SIGMETRICS, 2010.

[29] M. Mitzenmacher, A. W. Richa, and R. Sitaraman. The Power ofTwo Random Choices: A Survey of Techniques and Results .Handbook of Randomized Computing, 2000.

[30] R. Morris, E. Kohler, J. Jannotti, and M. F. Kaashoek. The clickmodular router. SIGOPS Operating Systems Review,33(5):217–231, 1999.

[31] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L.Peterson, J. Rexford, S. Shenker, and J. Turner. OpenFlow:enabling innovation in campus networks. ACM SIGCOMMComputer Communication Review, 38(2):69–74, 2008.

[32] V. Paxson. Bro: A System for Detecting Network Intruders inReal-Time. In Proc. USENIX Security Symposium, 1998.

[33] L. Rizzo, M. Carbone, and G. Catalli. Transparent accelerationof software packet forwarding using netmap. In Proc. Infocom,2012.

[34] M. Roughan. Simplifying the Synthesis of Internet TrafficMatrices. ACM SIGCOMM CCR, 35(5), 2005.

[35] V. Sekar, N. Egi, S. Ratnasamy, M. K. Reiter, and G. Shi. Designand implementation of a consolidated middlebox architecture.Technical Report UCB/EECS-2011-110, EECS Department,University of California, Berkeley, Oct 2011.

[36] V. Sekar, S. Ratnasamy, M. K. Reiter, N. Egi, and G. Shi. TheMiddlebox Manifesto: Enabling Innovation in MiddleboxDeployments. In Proc. HotNets, 2011.

[37] V. Sekar, M. K. Reiter, W. Willinger, H. Zhang, R. Kompella,and D. G. Andersen. cSamp: A System for Network-Wide FlowMonitoring. In Proc. of NSDI, 2008.

[38] N. Spring, R. Mahajan, and D. Wetherall. Measuring ISPTopologies with Rocketfuel. In Proc. of ACM SIGCOMM, 2002.

[39] M. Vallentin, R. Sommer, J. Lee, C. Leres, V. Paxson, andB. Tierney. The NIDS Cluster: Scalable, Stateful NetworkIntrusion Detection on Commodity Hardware. In Proc. of RAID,2007.

[40] H. Zhiteng. I/O Virtualization Performance.http://xen.org/files/xensummit_intel09/xensummit2009_IOVirtPerf.pdf.

Documents

Design and Implementation of a Consolidated Middlebox ... · Design and Implementation of a Consolidated Middlebox ... sign of the network controller. We describe the design of each