Design and Implementation of a Consolidated Middlebox

Design and Implementation of a Consolidated

Middlebox Architecture

Vyas SekarNorbert EgiSylvia RatnasamyMichael ReiterGuangyu Shi

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2011-110

http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-110.html

October 6, 2011

Copyright © 2011, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Design and Implementation of a Consolidated Middlebox ArchitectureVyas Sekar∗, Norbert Egi††, Sylvia Ratnasamy†, Michael K. Reiter?, Guangyu Shi ††

∗ Intel Labs, † UC Berkeley, ? UNC Chapel Hill, †† Huawei

AbstractMost network deployments respond to changing appli-cation, workload, and policy requirements via the de-ployment of specialized network appliances or “middle-boxes”. Today, however, middlebox platforms are ex-pensive and closed systems, with little/no hooks for ex-tensibility. Furthermore, they are acquired from indepen-dent vendors and deployed as standalone devices withlittle cohesiveness in how the ensemble of middleboxesis managed. As network requirements continue to growin both scale and variety, this bottom-up approach leadsmiddlebox deployments on a trajectory of growing de-vice sprawl with corresponding escalation in capital andmanagement costs.

To address this challenge, we present CoMb, a newarchitecture for middlebox deployments that system-atically applies the design principle of consolidation,both at the level of building individual middleboxes andmanaging a network of middleboxes. This paper ad-dresses key resource management and implementationchallenges that arise in exploiting the benefits of con-solidation in middlebox deployments. Using a prototypeimplementation in Click, we show that CoMb can reducethe network provisioning cost by up to 2.5× and reducethe load imbalance in a network by up to 25×.

1 IntroductionNetwork appliances or “middleboxes” such as WAN op-timizers, proxies, intrusion detection and prevention sys-tems, network- and application-level firewalls, cachesand load-balancers have found widespread adoption inmodern networks. Several studies report on the rapidgrowth of this market; the market for network securityappliances alone was estimated to be 6 billion dollars in2010 and expected to rise to 10 billion in 2016 [10]. Inother words, middleboxes are a critical part of today’snetworks and it is reasonable to expect that they will re-main so for the foreseeable future.

Somewhat surprisingly then, there has been relativelylittle research on how middleboxes are built and de-ployed. Today’s middlebox infrastructure has developedin a largely uncoordinated manner – a new form of mid-dlebox typically emerging as a one-off solution to a spe-cific need, “patched” into the infrastructure through ad-hoc and often manual techniques.

This bottom-up approach leads to two serious formsof inefficiency. The first is inefficiency in the use of in-

frastructure hardware resources. Middlebox applicationsare typically resource intensive and each middlebox is in-dependently provisioned for peak load. Today, becauseeach middlebox is deployed as a separate device, theseresources cannot be amortized across applications eventhough their workloads offer natural opportunities to doso (we elaborate on this in Section 3). Second, a bottom-up approach leads to inefficiencies in management; to-day, each type of middlebox application has its own cus-tom configuration interface, with no hooks or tools thatoffer network administrators a unified view by which tomanage middleboxes across the network.

As middlebox deployments continue to grow in bothscale and variety, these inefficiencies are increasinglyproblematic—middlebox infrastructure is on a trajectoryof growing device sprawl with corresponding escala-tion in capital and management costs. In Section 2, wepresent measured and anecdotal evidence that highlightsthese concerns in a real-world enterprise environment.

This paper presents CoMb,1 a top-down design formiddlebox infrastructure that aims to tackle the aboveinefficiencies. The key observation in CoMb is that theabove inefficiencies arise because middleboxes today arebuilt and managed as standalone devices. To addressthis, we turn to the age-old design principle of consol-idation and systematically re-architect middlebox infras-tructure to exploit opportunities for consolidation. Cor-responding to the inefficiencies, CoMb targets consoli-dation at two levels:1. Individual middleboxes: In contrast to standalone,

specialized middleboxes, CoMb decouples hardwareand software, thus enabling software-based imple-mentations of middlebox applications to run on a con-solidated hardware platform.2

2. Managing an ensemble of middleboxes: CoMb con-solidates the management of different middlebox ap-plications/devices into a single (logically) centralizedcontroller that takes a unified, network-wide view–generating configurations and accounting for policyrequirements across all traffic, all applications, and allnetwork locations. This architecture stands in contrastto today’s approach where each middlebox applicationand/or device is managed independently.Consolidation is, of course, a well-known system de-

sign principle. Likewise, in a general context, the above

1The name CoMb captures our goal of Consolidating Middleboxes.2As we discuss in Section 4, this hardware platform can comprise

both general-purpose and specialized components.

1

strategies are not new – e.g., there’s a growing literatureon centralized network management (e.g., [14, 33, 24,23]), and software consolidation is commonly used indata centers. To our knowledge, however, there has beenno work on quantifying the benefits of consolidation formiddlebox infrastructure, nor any in-depth attempt to re-architect middleboxes (at both the device- and network-level) to exploit consolidation.

Consolidation effectively “de-specializes” middleboxinfrastructure since it forces greater modularity and ex-tensibility. Typically, moving from a specialized archi-tecture to one that is more general results in less, notmore, efficient resource utilization. We show however(in Section 3) that consolidation creates new opportuni-ties for efficient use of hardware resources. For example,within an individual box, we can reduce resource require-ments by leveraging (previously unexploitable) oppor-tunities to multiplex hardware resources and reuse pro-cessing modules across different applications. Similarly,consolidating middlebox management into a network-wide view exposes the option of spatially distributingmiddlebox processing to use resources at different loca-tions.

However, the benefits of consolidation come withchallenges. The primary challenge is that of resourcemanagement since middlebox hardware resources arenow shared across multiple heterogeneous applicationsand across the network. We thus need a resource man-agement solution that matches demands (i.e., what sub-set of traffic is to be processed by each application, whatresources are required by different applications) to re-source availability (e.g., CPU cycles and memory at var-ious network locations). In Section 4 and Section 5, wedevelop a hierarchical strategy that operates at two levels– network-wide and within an individual box – to ensurethe network’s traffic processing demands are met whileminimizing resource consumption.

We prototype a CoMb network controller leveragingoff-the-shelf optimization solvers. We build a prototypeCoMb middlebox platform using Click [31] running ongeneral-purpose server hardware. As test applications weuse: (i) existing software implementations of middleboxapplications (that we use with little/no modification) and(ii) applications that we implement using a modular dat-apath. (The latter developed to capture the benefits ofprocessing reuse). Using our prototype and trace-drivenevaluations, we show that:• At a network-wide level, CoMb reduces aggregate re-

source consumption by a factor 1.8–2.5× or reducethe maximum per-box load by a factor 2–25×, for arange of real-world scenarios.

• Within an individual box, CoMb imposes little or min-imal overhead for existing middlebox applications –in the worst case, we record a 0.7% performance

Appliance type NumberFirewalls 166

NIDS 127Conferencing/Media gateways 110

Load balancers 67Proxy caches 66VPN devices 45

WAN optimizers 44Voice gateways 11

Middleboxes total 636Routers ≈ 900

Table 1: Devices in the enterprise network

drop relative to running the same applications inde-pendently on dedicated hardware.

Roadmap: In the rest of the paper, we begin with amotivating scenario in Section 2. Section 3 highlightsthe new efficiency opportunities with CoMb, before Sec-tion 4 describes the design of the network controller. Wedescribe the design of each CoMb box in Section 5 andour prototype implementation in Section 6. We evaluatethe benefits and potential overheads with CoMb in Sec-tion 7. We discuss outstanding issues in Section 8 and re-lated work in Section 9, before concluding in Section 10.

2 MotivationWe begin with anecdotal evidence in support of our claimthat middlebox deployments constitute a vital componentin modern networks and the challenges that arise therein.Our observations are based on a study of middlebox de-ployment in a large enterprise network and discussionswith the enterprise’s administrators. The enterprise spanstens of sites and serves more than 80K users [39].

Table 1 summarizes the types and numbers of differ-ent middleboxes in the enterprise. We see that the totalnumber of middleboxes, is comparable to the number ofrouters! Middleboxes are thus a vital portion of the en-terprise’s network infrastructure. We further see a largediversity in the type of middleboxes; studies suggest sim-ilar diversity in ISPs and datacenters as well [37, 22].

The administrators indicated that middleboxes repre-sent a significant fraction of their (network) capital ex-penses and expressed the belief that processing com-plexity contributes to high capital costs. They furtherexpressed concern over the anticipated mounting costs.Two nuggets emerged from their concerns. First, they re-vealed that each class of middleboxes is currently man-aged by a dedicated team of administrators. This is inpart because the enterprise uses different vendors foreach application in Table 1; the understanding requiredto manage and configure each class of middlebox leadsto inefficient use of administrator expertise and signifi-cant operational expense. The lack of high-level config-

2

uration interfaces further exacerbates the problem. Forexample, significant effort was required to manually tunewhat subset of traffic should be directed to the WAN op-timizers to balance the tradeoff between the bandwidthsavings and appliance load. The second nugget of in-terest was their concern that market trends in the “con-sumerization” of devices (e.g., smartphones, tablets) in-creases the need for in-network capabilities [10]. Thelack of extensibility in middleboxes today inevitablyleads to further appliance sprawl, with associated in-crease in capital and operating expenses.

Despite these concerns, administrators reiterated thevalue they find in such appliances, particularly in sup-porting new applications (e.g., teleconferencing), in-creasing security (e.g., IDS), and improving performance(e.g., WAN optimizers).

3 CoMb: Overview and OpportunitiesThe previous discussion shows that even though middle-boxes form a critical part of the network infrastructure,they remain expensive, closed platforms that are difficultto extend, and difficult to manage. This motivates usto rethink how middleboxes are designed and managed.We envision an alternative architecture, called CoMb,wherein software-centric implementations of middle-box applications are consolidated to run on a sharedhardware platform, managed in a logically centralizedmanner (see Figure 4).

The qualitative benefits of this proposed architectureare easy to see. Software-based solutions reduce the costand development cycles to build and deploy new mid-dlebox applications (as independently argued in paral-lel work [18]). Consolidating multiple applications onthe same physical platform reduces device sprawl, andwe already see early commercial offerings leveragingthis [9, 4]. Finally, the use of centralization to simplifynetwork management is also well known [24, 23, 14].

While the qualitative appeal is evident, there are prac-tical concerns with respect to efficiency. Typically, mov-ing from a monolithic, specialized architecture to onethat is more general and extensible results in less efficientresource utilization. However, as we show next, CoMbintroduces new efficiency opportunities that do not arisewith today’s middlebox deployments.

3.1 Application multiplexingConsider a WAN optimizer and IDS running at an enter-prise site. The former optimizes file transfers betweentwo enterprise sites and may see peak load at night whensystem backups are run. In contrast, the IDS may seepeak load during the day because it monitors users’ webtraffic. Suppose the volumes of traffic processed by the

0

0.2

0.4

0.6

0.8

1

07-09,06:00 07-09,17:00 07-10,04:00 07-10,15:00 07-11,02:00

Nor

mal

ized

util

izat

ion

(%)

Time (mm-dd,hr)

WAN optimizerProxy

Load BalancerFirewall

Figure 1: Middlebox utilization peak at different times

WAN optimizer and IDS at two time instants t1, t2 are10,50 packets and 50,10 packets respectively. Todayeach hardware device must be provisioned to handle apeak load of max{10,50} = 50. A CoMb box, runningboth a WAN optimizer and the IDS on the same hard-ware platform, can flexibly allocate resources as the loadvaries. Thus, it needs to be provisioned to handle thepeak total load of 60 packets or 40% fewer resources.

Figure 1 shows a time series of the utilization of fourmiddleboxes at an enterprise site, each normalized byits maximum observed value. If NormUtiltapp is the nor-malized utilization of the device app at time t, to quan-tify the benefits of multiplexing, we compare the sum ofthe peak ∑app maxt{NormUtiltapp} = 4, and the peak to-tal maxt{∑app NormUtiltapp} = 2.86. Thus, in Figure 1,multiplexing requires 4−2.86

4 = 28% fewer resources.

3.2 Reusing software elementsEach middlebox typically needs low-level modulesfor packet capture, parsing headers, reconstructingflow/session state, parsing application-layer protocolsand so on. If the same traffic is processed by manyapplications–e.g., HTTP traffic is processed by an IDS,proxy, and an application firewall–each appliance has torepeat these common actions for every packet. Whenthese applications run on a consolidated platform, wecould reuse these basic modules (Figure 2).

Consider an IDS and proxy. Both need to recon-struct session- and application-layer state before runninghigher-level actions. Suppose each device needs 1 unitof processing per packet with these common tasks con-tributing 50% of the processing. Both appliances pro-cess HTTP traffic, but may also process traffic uniqueto each context; e.g., IDS processes UDP traffic whichthe proxy ignores. Suppose there are 10 UDP packetsand 45 HTTP packets. The total resource requirementis (IDS = 10 + 45) + (Proxy = 45) = 100 units. Thesetup in Figure 2 avoids duplicating the common tasksfor HTTP traffic and needs 45∗0.5= 22.5 units or 22.5%fewer resources.

To measure the traffic overlap, we obtain (public) con-figurations for Bro [34] and Snort [1] and the (private)

3

Session Management

Protocol Parsers

VPN WanOpt IDS Proxy

Firewall

Figure 2: Reusing modules across middlebox applica-tions

N1

N2

N3

P1: N1à N2 T1 15 T2 30 T3 15

P2: N2à N3 T1 15 T2 15 T3 30

P3: N3à N1 T1 15 T2 15 T3 30

P1 P2 P3 T1 15 0 5 20 T2 20 0 0 20 T3 15 0 5 20

N1’s assignments N2’s assignment N3’s assignment P1 P2 P3

T1 0 15 5 20 T2 10 10 0 20 T3 0 20 0 20

P1 P2 P3 T1 0 0 20 20 T2 0 5 15 20 T3 0 10 10 20

Figure 3: Spatial distribution as traffic changes

configuration for a WAN optimizer. Then, using flow-level traces (from Internet2), we find that the traffic over-lap between applications is typically 64-99% [39]. Weare not aware of middlebox vendors with reusable mod-ules and data on their software design is hard to obtain.Our benchmarks from Section 7.1 show that the commonmodules can contribute 26-88% across applications.

3.3 Spatial distributionConsider the topology in Figure 3 with three nodes N1–N3 and three end-to-end paths P1–P3. The traffic onthese paths peaks to 30 packets at different times asshown. Suppose we want all traffic to be monitored byIDSes. The default deployment is an IDS at each ingressN1, N2, and N3 for monitoring traffic on P1, P2, and P3respectively. Each such IDS needs to be provisioned tohandle the peak volume of 30 units with a total network-wide cost of 90 units.

With a centralized network-wide view, however, wecan spatially distribute the IDS responsibilities. That is,each IDS at N1–N3 processes a fraction of the traffic onthe paths traversing the node (e.g., [38]).3 For example,

3Here, we assume that IDSes are “on-path” or their upstreamrouters redirect packets to them [16].

Network-wide Controller

AppSpec BoxSpec NetworkSpec

Session

Protocol

Extensible Standalone

Processing Responsibilities

Session

Protocol

Extensible Standalone

Figure 4: The network controller assigns processing re-sponsibilities to each CoMb box.

at time T1, N1 uses 15 units for P1 and 5 for P3; N2 uses15 units for P2 and 5 P3; and N3 devotes all 20 units toP3. We can generate similar configurations for the othertimes as shown in Figure 3. Thus, distribution reducesthe total provisioning cost 90−60

90 = 33% compared to aningress-only deployment. Note that this is orthogonal toapplication multiplexing and software reuse.

Using time-varying traffic matrices from Internet andthe Enterprise network, we find that spatial distributioncan provide 33−55% savings in practice.

3.4 CoMb OverviewBuilding on these opportunities, we envision the archi-tecture in Figure 4. Each middlebox in CoMb runs mul-tiple software-based applications (e.g., IDS, proxy, Fire-wall). These applications can be obtained from indepen-dent vendors and could differ in their software architec-tures (e.g., standalone vs. modular). CoMb’s networkcontroller assigns processing responsibilities across thenetwork. Each CoMb middlebox receives this configu-ration and allocates hardware resources to the differentapplications.

4 CoMb Network Controller DesignIn this section, we describe the design of CoMb’s net-work controller and the management problem it solvesto assign network-wide middlebox responsibilities.

4.1 Input ParametersWe begin by describing the three high-level inputs thatthe network controller needs.• AppSpec: For each application m (e.g., IDS, proxy,

firewall), this specifies: (1) Tm, the traffic that m needsto run on (e.g., what ports and prefixes), and (2) pol-icy constraints that the administrator wants to enforceacross different ms. These constraints specify rela-tionships of the form m ≺ m′ on the order in whichdifferent applications need to run [21]. For example,all web traffic should first go through a firewall, then

4

an IDS, and finally a web proxy. Most applicationstoday (e.g., firewalls, load balancers, IDSes, proxies,WAN optimizers) operate at a session or connection-level granularity and we assume each m operates atthis granularity.4

• NetworkSpec: This has two components: (1) a de-scription of end-to-end routing paths and the locationof the middlebox nodes on each path and, (2) a spec-ification of different types of traffic T partitioned intoclasses T =

⋃c Tc. Each class c can be a high-level

description of the form “port-80 sessions initiated byhosts at ingress A to servers in egress B” or describedby more precise traffic filters defined on the IP 5-tuple(e.g., srcIP=10.1.∗ .∗, dstIP=10.2.∗ .∗, dstport=80, sr-cport=*). For brevity, we assume each class Tc hasa single end-to-end path with the forward and reverseflows within a session following the same path (in op-posite directions).5 Each application m subscribes toone or more of these traffic classes; i.e., Tm ∈ 2T .

• BoxSpec: This captures the hardware capabilities ofthe middlebox hardware: Provn,r is the amount of re-source r (e.g., CPU, memory) that node n is provi-sioned, in units suitable for that resource. Each plat-form may also (optionally) support specialized accel-erators (e.g., GPU units or crypto co-processors).

Given the hardware configurations, we also needthe (expected) per-session resource footprint, on theresource r, of running a application m. Each m mayhave some affinity for hardware accelerators; e.g.,some IDSes use hardware-based DPI. These require-ments may be strict (i.e., the application only workswith hardware support) or opportunistic (i.e., offloadfor better performance). Now, the middlebox hard-ware at each node n may or may not have such accel-erators. Thus, we use generalized resource footprintsFm,r,n that depend on the specific middlebox node toaccount for the presence/absence of hardware acceler-ators. For example, the footprint will be higher on anode without an optional hardware accelerator and theapplication needs to emulate this feature in software.In practice, these inputs are already available or easy

to obtain. The NetworkSpec for routing and traffic infor-mation is already collected for other network manage-ment applications such as traffic engineering or anomalydetection [13]. The traffic classes and policy constraintsin AppSpec and the hardware capacities Provn,rs areknown to administrators; we simply require that these bemade available to the network controller. The only com-ponent that imposes new effort is the set of Fm,r,n valuesin BoxSpec. These can be obtained by running offlinebenchmarks similar to Section 7; even this effort is re-

4It is easy to extend to applications that operate at per-packet orper-flow granularity; we do not discuss this for brevity.

5We discuss how to handle multiple/asymmetric paths in Section 8.

quired infrequently (e.g., only after hardware upgrades).

4.2 A strawman formulationGiven these inputs, the controller’s goal is to assign pro-cessing responsibilities to middleboxes such that all pol-icy requirements are satisfied. That is, each class of traf-fic is processed by the required sequence of applications.At the same time, we want to ensure that each node op-erates within its provisioned capacity and the processingload is balanced across the network.

We begin with a strawman formulation of the man-agement problem involved here. Even though this straw-man will not be practical, it is a useful exercise becauseit highlights the key constraints and parameters involvedand it establishes a theoretically optimal baseline to eval-uate practical approximations.

At a high-level, we need to decide if middlebox n runsthe application m on a session i. We can capture thisusing a {0,1} decision variable for each n, i,m combina-tion. Doing so, however, ignores the potential for reusingcommon actions (e.g., session reassembly) across appli-cations. To capture reuse, we decompose the applica-tion m into its constituent actions, some of which areapplication-specific (and hence non-reusable) and otherswhich are common/reusable (as in Figure 2). We intro-duce {0,1} decision variables ai,n that specify if node nruns action a on the session i.

As Figure 2 shows, each action a may run on top ofother lower-layer actions. Thus, if a depends on a lower-layer action a′, denoted by a @ a′ (e.g., IDS depends-onsession reconstruction), then a can occur on a node onlyif this node has already run a′ for this session. Formally,

∀i,n,∀a @ a′ : ai,n ≤ a′i,n (1)

Next, we model the processing requirements for eachapplication m. Let n ∈path c denote that node n is on therouting path for the traffic in Tc. For simplicity, we as-sume that each m can be run anywhere along its path.(Section 7.3 presents an extension when some applica-tions have topological placement constraints.) Hence, weneed to ensure that each session of interest to m has beenprocessed by an instance of m somewhere along the path.For convenience, we combine the non-reusable actionsfor each m and express these constraints only for this ag-gregate action; let am denote this aggregate non-reusableaction for m. Thus,

∀m,∀Tc ∈ Tm,∀i ∈ Tc : ∑n∈pathc

ami,n = 1 (2)

Now, we also need to model policy dependencies acrossapplications. Suppose we have a policy constraint be-tween m ≺ m′ (e.g., firewall before proxy). Let <c cap-ture the on-path ordering between nodes on the route for

5

class c.6 Then, we need to ensure that some upstreamnode on the path has already run m before we can run m′:

∀m≺ m′,∀Tc ∈ Tm∩Tm′ ,∀i ∈ Tc,

∀n ∈path c : am′i,n ≤ ∑

n′∈pathcn′<cn

ami,n′ (3)

Last, we need to model the resource consumption oneach middlebox. Because we decomposed each m into itsconstituent actions, we need to correspondingly split theresource footprints Fm,r,n. Let Fa,r,n be the per-sessionresource footprint of action a on the resource r (e.g.,CPU, memory) defined in units suitable for the resource.As discussed earlier, we allow Fa,r,n to differ across thenodes to account for differences in their (specialized)hardware capabilities. To capture strict requirements,where some a cannot run without a specific hardware ac-celerator, we use a simple preprocessing step to set theF values for ns without this accelerator to ∞ (some largeconstant). This ensures that this action never gets as-signed on nodes without the accelerator.

With this in place, we can account for the total loadon resource r at node n and ensure that it never exceedsthe provisioned capacity Provn,r:

∀n,r : loadn,r =∑a,i ai,n×Fa,r,n

Provn,r≤ 1 (4)

Given these constraints, we can consider differentmanagement objectives: (1) minimizing the cost to pro-vision the network, min∑n,r Provn,r, to handle a given setof traffic patterns, or (2) having chosen a provisioningregime, load balancing to minimize the maximum loadacross the network, minmaxn,r{loadn,r}, under the cur-rent workload.

Now, the above model is functionally complete. Itfaithfully captures (a) the reuse of common actionsacross applications, (b) policy dependencies across ap-plications, and (c) the use of specialized hardware ca-pabilities. However, this model is woefully impracti-cal. Constructing a session-level formulation is tedious;worse still, solving this involves a large discrete opti-mization problem which is theoretically intractable.

4.3 A Practical ReformulationNext, we reformulate the above management problemunder a slightly constrained operational model. Whilethis is not theoretically optimal, it is tractable and hasnear-optimal performance in practice (Section 7.4).

The main idea in this alternative model is that all ap-plications pertaining to a given session run on the same

6For a bi-directional session, path ordering is based on the forwardor initiating direction.

m1 m2

common

Footprint on resource Dependency on action

HTTP UDP

HTTP NFS

2 3 1 1

resource1 resource2

HyperApp1=m1 & m2

HyperApp2 =m1

HyperApp3= m2

4 3

1 3

4 1

resource1 resource2 resource1 resource2

resource1 resource2

HTTP

UDP

NFS

Figure 5: Each hyperapp is a single logical task whosefootprint is equivalent to taking the logical union of itsconstituent actions.

node. That is, if some session i needs to be processedby applications m1 and m2 (and nothing else), then weforce both m1 and m2 to analyze i on the same node. Forexample, m1 (say IDS) processes HTTP and UDP traffic;m2 (say WAN-optimizer) processes HTTP and NFS traf-fic. Now, consider a HTTP session i. In the strawman,we could run m1 on node n1 and m2 on node n2 for i.The new model, however, will run both m1 and m2 for ion n1. Note that we can still assign different sessions toother nodes; for a different HTTP session i′, m1 and m2could run on n2. The key difference here is that the straw-man model has an extra degree of freedom where it canchoose to replicate common tasks, if it is optimal to doso. Appendix A shows a corner case when the strawmansolution could be better than this hyperapp model.

Under this operational model, for each class c we iden-tify the exact sequence of applications that run on ses-sions in c. We call each such sequence a hyperapp. For-mally, if hc is the hyperapp for the traffic class c, then∀m : Tc ∈ Tm ⇔ m ∈ hc. (Different classes could havethe same hyperapp.) Each hyperapp also statically de-fines the policy order across its constituent applications.Figure 5 shows the three hyperapps for the previous ex-ample: one for HTTP traffic (processed by both m1 andm2), and one each of UDP/NFS traffic (processed by ei-ther m1 or m2 but not both).

This new model serves three practical purposes. First,it provides an alternative way to capture savings fromreusing common actions. Specifically, it eliminates theneed to model discrete actions and their dependencies inEq(1). Similar to the per-session resource footprint Fa,r,nof a discrete action a on resource r, we can define the per-session hyperapp-footprint of the hyperapp h on resourcer as Fh,r,n. This implicitly accounts for the common ac-tions across applications within h. Note that the RHS ofFigure 5 does not show the common action; instead, weinclude the costs of the common action when comput-ing the F values for each hyperapp. As in the strawman,

6

we allow Fs to capture the availability of hardware ac-celerators. This requires us to the hyperapps and theirF values as part of the inputs in AppSpec and BoxSpec.We do so by explicitly enumerating all possible hyper-app sequences requiring time exponential in the numberof applications. Fortunately, this is a one-time task andthere are only a handful of applications (< 10) as Table 1shows.

Second, it obviates the need to explicitly model the or-dering constraints across applications in Eq(3). Becauseall the applications relevant to a session run on the samephysical node, enforcing policy ordering becomes a sim-pler local scheduling decision. This can be delegated toeach CoMb box (Section 5).

Third, it simplifies our traffic model. Instead of speci-fying each discrete session in Eqs(2) and (4), we can con-sider the total volume of traffic in each class. This meanswe can aggregate the discrete per-session variables foreach action a into continuous variables dc,n specifyingthe fraction of traffic belonging to the class c that eachnode n has to process (i.e., run the hyperapp hc). Let |Tc|to denote the volume of traffic in class c.

Minimizemaxr,n{loadn,r}, subject to (5)

∀n,r : loadn,r = ∑c:n∈pathc

dc,n|Tc|Fhc,r,n

Provn,r(6)

∀c : ∑n∈pathc

dc,n = 1 (7)

∀c,n : 0≤ dc,n ≤ 1 (8)

With the above reformulation, the optimization problemcan be expressed as a linear program in Eq(5)–Eq(8).(For brevity, we only show the load balancing objective.)The controller solves the optimization to find the opti-mal values of the dc,ns. Then it maps these d values intosuitable device-level configurations for each middleboxn. From a design viewpoint, we do not require a specificimplementation and discuss two alternatives in Section 6.

5 CoMb Single-box DesignWe now turn to the design of a single CoMb box. As de-scribed in the earlier sections, the output of the networkcontroller is an assignment of processing responsibilitiesto each CoMb box. This assignment specifies:• a set of (traffic class, fraction) pairs {(Tc, dc,n)} that

describes what traffic (type and volume) is to be pro-cessed by the CoMb box n in question

• the hyperapp hc associated with each traffic class Tc,where each hyperapp is an ordered set of one or moremiddlebox applications.We start with our overall system architecture and

then describe how we parallelize this architecture overa CoMb box’s hardware resources.

… Virtual Queues

NIC

Policy Shim (Pshim)

Core1 Core4

M1 M7

…

… M2 M4

Traffic

Apps

Policy Enforcer

Classification

Figure 6: Logical view of a CoMb box

M1 M2 M1 M2

PShim

App-per-core HyperApp-per-core

PShim

M2 M3

PShim

Core3

M3

PShim

Core1 Core2 Core1 Core2

Figure 7: An example with two hyperapps: m1 ≺m2 andm2 ≺ m3. hyperapp-per-core clones m2.

5.1 System Architecture

At a high level, packet processing within a CoMb boxcomprises three logical stages as shown in Figure 6. Anincoming packet must first be classified, to identify whattraffic class Tc it belongs to. Next, the packet is handedto a policy enforcement layer responsible for steering thepacket between the different applications correspondingto the packet’s traffic class, in the appropriate order; fi-nally, the packet is processed by the appropriate middle-box application(s). Of these, classification and policyenforcement are a consequence of our consolidated de-sign and hence we aim to make these as lightweight aspossible.We elaborate on the role and design options foreach stage next.

7

Classification: The CoMb box receives a serial streamof undifferentiated packets. Since different packets maybe processed by different applications, we must firstidentify what traffic class a packet belongs to. Thereare two broad design options here. The first is to do theclassification in hardware. Many commercial appliancesrely on custom NICs with ASICs for sophisticated high-speed classification and even commodity server NICs to-day support such capabilities [5]. A common featureacross these NICs is that they support a large numberof hardware queues (on the NIC itself) and can be con-figured to triage incoming packets into these queues us-ing certain functions (typically exact-, prefix- and range-matches) defined on the packet headers. The second op-tion is software-based classification – incoming packetsare classified entirely in software and placed into one ofmultiple software queues.

The tradeoff between the two options is one of ef-ficiency vs. flexibility. Software classification is fullygeneral and programmable but consumes significant pro-cessing resources; e.g., Ma et al. report general software-based classification at 15 Gbps (comparable to a com-modity NIC) on a 8-core Intel Xeon X5550 server [40].

Our current implementation assumes hardware classi-fication. From an architectural standpoint, however, onecan view the two options as equivalent in the abstrac-tion they expose to the higher layers: multiple (hardwareor software) queues with packets from a traffic class Tcmapped to a dedicated queue.

We assume that the classifier has at least as manyqueues as there are hyperapps. This is reasonable sinceexisting commodity NICs already have 128/256 queuesper interface, specialized NICs even more, and software-based classification can define as many as needed; with6 applications, the worst-case number of hyperapps is26 = 64.

A final question is whether the middlebox receivespackets that it has not been assigned to process. We deferthis to Section 6.

Policy Enforcer: As mentioned, the job of the policyenforcement layer is to ‘steer’ a packet p in the correctorder between the different applications associated withthe packet’s hyperapp. Why is this needed? The applica-tions on CoMb box could come from independent ven-dors and we want to run applications such that they areoblivious to our consolidation. Hence, for a hyperappcomprised of (say) IDS followed by Proxy, the IDS ap-plication would not know to send the packet to the Proxyfor further processing. Since we do not want to mod-ify applications, we introduce a lightweight policy shim(pshim) layer.

We leverage the above classification architecture to de-sign a very lightweight policy enforcement layer. Wesimply associate a separate instance of a pshim with each

output queue of the classifier. Since each queue onlyreceives packets for a single hyperapp, the associatedpshim knows that all the packets it receives are to berouted through the identical sequence of applications.

Thus beyond retaining the sequence of applicationsfor its associated hyperapp/traffic-class, the pshim doesnot require any complex annotation of packets or state-keeping. In fact, if the hyperapp consists of a single ap-plication, the pshim is essentially a NOP.

Applications: Our design supports two application soft-ware architectures: (1) standalone software processes(that run with little/no modification) and (2) applicationsbuilt atop an ‘enhanced’ network stack with reusablesoftware modules for common tasks such as session re-construction and protocol parsing as described in Sec-tion 6. We currently assume that applications using cus-tom accelerators access these using their own libraries.

5.2 Parallelizing a CoMb boxWe assume a CoMb box offers a number of parallelcomputation cores – such parallelism exists in general-purpose servers (e.g., our server-based prototype uses8 x86 ‘Westmere’ cores) and is even more prevalent inspecialized networking hardware (e.g., Cisco’s QuantumFlow packet processor offers 40 Tensilica cores). Wenow describe how we parallelize the functional layers de-scribed earlier on this underlying hardware.

Parallelizing the classifier: Since we assumed hard-ware classification, our classifier runs on the NIC anddoes not require parallelization across cores. We referthe reader to [40] for a discussion of how a software-based classifier might run on a multi-core system.

Parallelizing a single hyperapp: Recall that a hyperappis really a logical entity–a sequence of middlebox appli-cations that all need to process a packet. The two optionswe have in parallelizing a hyperapp are (Figure 7) :1. App-per-core: each application belonging to the hy-

perapp is run on a separate core and the packet issteered between cores.

2. hyperapp-per-core: all applications belonging to thehyperapp are run on the same core; hence a given ap-plication is cloned with as many instances as the num-ber of hyperapps in which it appears.The advantage of the second over the first approach

is that a packet is processed in its entirety on a singlecore, avoiding the overhead of inter-core communicationand cache invalidations that may arise as shared stateis accessed by multiple cores. (This overhead occursmore frequently for applications built to reuse process-ing modules in a common stack.) The disadvantage ofthe hyperapp-per-core relative to the app-per-core, is that

8

it could incur overhead due to context switches and po-tential contention over shared resources (e.g., data andinstruction caches) on a single core. Which way the scaletips depends on the overheads associated with inter-corecommunication, context switches, etc. which vary acrosshardware platforms. We ran a number of tests (differ-ent applications and hyperapp scenarios) on our proto-type server (Section 7) and found that the hyperapp-per-core approach consistently offered superior or compara-ble performance.

Table 2 shows a sample test result for two synthetichyperapps: (1) multi-IDS uses k instances of a Snort pro-cess in sequence and (2) multi-loopback uses k instancesof a simple loopback process. We pick these as repre-senting two extremes of resource intensiveness – Snortis both CPU and memory intensive, while the loopbackconsumes negligible CPU/memory. We choose to chaininstances of a single application for simplicity (resultsusing a diverse set of functions yielded similar results).

k, Throughput (Mpps)#fn-per Multi-Snort Multi-Loopback

hyperapp fn/core hypapp/core fn/core hypapp/core2 1.32 1.38 -1 -14 0.66 0.67 1.62 1.486 0.43 0.42 1.02 0.93

Table 2: Total throughput as a function of length of thehyperapp chain on a 12-core XXX GHz foo platform.The throughput is lower with longer chains as the per-packet processing increases.

Table 2 shows throughput vs. k. We see that formore realistic applications (multi-Snort), the hyperapp-per-core strategy performs better or nearly identical tothe application-per-core strategy. The worst case for thehyperapp-per-core approach is actually the trivial loop-back application and even in that case throughput dropsby at most 8%. We note that these results are consis-tent with independent results for parallelizing applica-tion modules in software routers [30]. In light of our ex-periments and these independent results, we choose thehyperapp-per-core model because it simplifies the paral-lelization of the pshim (see below) and ensures core-localaccess to reusable data structures.

Parallelizing the pshim layer: This leaves us withthe question of how we parallelize the policy enforce-ment layer. Recall that we had decided to have a sep-arate instance of a pshim for each hyperapp. Giventhe hyperapp-per-core approach, parallelizing it is triv-ial. We simply assign a pshim instance to run co-locatedat the same core as its associated hyperapp.

Parallelizing multiple hyperapps: We are left with oneoutstanding question: given multiple hyperapps, how

Hyper App1

Hyper App2

Hyper App4

Hyper App3

Hyper App3

PShim PShim PShim PShim PShim

M1 M4 M1 M4 M2 M3

Q1 Q3 Q2 Q4 Q5

M1 M5

Core 1 Core 3 Core 2

NIC hardware

Figure 8: CoMb box: Putting the pieces together

many cores, or fraction of a core, do we assign each?The reason we might create multiple instances of a sin-gle hyperapp is if the total workload for some hyperappexceeds the processing capacity of a single core. That is,given the total traffic that node n needs to process for hy-perapp h, ∑c:hc=h dc,n|Tc|, and the per-packet CPU foot-prints Fh,CPU,n, we need to instantiate this hyperapp h onmultiple cores if needed. At the same time, we also wantto avoid a skewed allocation across cores.

This hyperapp-to-core mapping problem can be ex-pressed as a simple linear program that assigns a frac-tion of the traffic relevant to h to each core. There aretwo constraints: one to ensure that no core exceeds itscapacity and another to ensure that each hyperapp’s pro-cessing work is completely assigned. Let gh,j denote thefraction (∈ [0,1]) of traffic relevant to h that should be as-signed to the core j. Eq(10) ensures that each core doesnot exceed its processing capacity Procj and Eq(11) en-sures that each hyperapp’s processing work is assigned.In practice, this calculation need not occur at the CoMbbox; the controller can run this optimization and push theresulting configuration.

minmaxj{cpuloadj} (9)

∀j : cpuloadj =∑h gh,j|Th,n|Fh,CPU,n

Procj≤ 1 (10)

∀h : ∑j

gh,j = 1 (11)

∀h, j : 0≤ gh,j ≤ 1 (12)

5.3 Recap and Discussion

Combining the previous design decisions brings us to thedesign in Figure 8. We see that:• Each core is assigned one or more hyperapps; all ap-

plications within a hyperapp run on the same core,and hyperapps whose total workload exceeds a sin-gle core’s capacity are instantiated on multiple cores(e.g., HyperApp3 in Figure 8).

9

• Incoming packets are classified at the NIC and placedinto one of multiple NIC queues; each traffic class isassigned to one or more queues and different trafficclasses are mapped to different queues.

• Each hyperapp instance has its corresponding pshiminstance; the pshim is pinned to the same core as itsassociated hyperapp and reads packets from a dedi-cated NIC queue; e.g., HyperApp3 in Figure 8 runson Core2 and Core3 and has two separate pshims.7

The resultant design has several desirable properties con-ducive to achieving high performance:• a packet is processed in its entirety on a single core

(avoiding inter-core synchronization overheads)• we introduce no shared data structures across cores

(avoiding needless cache invalidations)• there is no contention for access to NIC queues (avoid-

ing the overhead of locking)• policy enforcement is lightweight (stateless and re-

quiring no marking or modification of packets)

6 ImplementationIn this section, we describe prototype implementationsof the different components in the CoMb architecture.

6.1 CoMb Controller

We implement the controller’s algorithms using an off-the-shelf solver (CPLEX). The controller periodically runsan optimization that takes as inputs: the current per-application-port traffic matrix per ingress-egress pair, thetraffic of interest to each application, policy orderingand hardware-accelerator constraints for each applica-tion, and the resource footprints (per-application for stan-dalone and per-action for modular applications). Thecontroller runs a pre-processing step to generate the hy-perapps and their effective resource footprints taking intoaccount the affinity of actions/applications for specificaccelerators.

After running the optimization, it maps the dc,n val-ues to a device-level configuration in one of two ways.If the CoMb box has TCAM-like classification [5], thecontroller maps each dc,n into a set of (non-overlapping)traffic filters. As a simple example, suppose c de-notes all traffic from sources in 10.1.0.0/16 to desti-nations in 10.2.0.0/16, and dc,n1 = dc,n2 = 0.5. Thenthe filters for n1 = 〈10.1.0.0/17,10.2.0.0/16〉 and n2 =〈10.1.128.0/17,10.2.0.0/16〉. (One subtle issue is thatit also installs filters corresponding to traffic in the re-verse direction.) Note that if each CoMb box is off-path [16], these filters can be pushed to the upstream

7The traffic split between the two instances of HyperApp3 also oc-curs in the NIC using filters as in Section 6

Packet Capture

Session Reconstruc2on

Cache Signature IDS FlowMon

HTTP

Load Balancer

TFTP NFS

Applications

Reusable modules

Figure 9: Our modular middlebox implementation

router/switch. If the NIC doesn’t support such filters, orhas a limited number of filter entries, the controller usesa hash-based configuration [38]. For the above example,it sends n1 = 〈10.1.0.0/16,10.2.0.0/16,hash ∈ [0,0.5]〉and n2 = 〈10.1.0.0/16,10.2.0.0/16,hash ∈ [0.5,1]〉.Each device’s pshim does a fixed-length (/16) prefixlookup, computes a direction-invariant hash of the IP 5-tuple [29], and checks if it is in the assigned range.

6.2 CoMb box prototypeWe prototype a CoMb box on a general-purpose server(without accelerators) with two Intel Westmere CPUseach with four cores at 3.47GHz (X5677) and 48GBmemory, configured with four Intel 82599 10 GigE NICports [5] each capable of supporting up to 128 queues,running Linux (kernel v.2.6.24.7).

Classification: We leverage the classification capabili-ties on the NIC. The NIC classifies the packets and de-multiplexes them into separate hardware queues basedon the filters (from the controller) for each hyperapp.The 82599 NIC supports 32K classification entries over:src/dst IP addresses, src/dst TCP/UDP ports, IP protocol,VLAN header, and a flexible 2-byte tuple anywhere inthe first 64 bytes of the packet. We currently use only theaddress and port fields to create filter entries.

Policy Enforcer: We implement the pshim in kernel-mode SMP-Click [31] following the design in Section 5.In addition to the policy enforcement, the pshim im-plements two additional functions: (1) creating inter-faces for the application processes to receive/send pack-ets from/to (see below) and (2) the above (optional) hash-based check to decide whether to process or ignore a spe-cific packet.

6.3 CoMb applicationsOur prototype supports two application architectures:modular middlebox applications in Click and standalonemiddlebox processes (e.g., Snort, Squid).

Modular middlebox applications: As a proof-of-concept prototype, we implement a signature-basedintrusion detection, flow-level monitoring, a caching

10

proxy, and a load balancer as user-level modules in Click(Figure 9). As such, our focus is to demonstrate the fea-sibility of building modular middlebox applications andestablish the potential for reuse. (We leave it to futurework to explore the choice of an ideal software architec-ture and an optimal set of reusable modules.)

To implement these applications, we port the ses-sion reconstruction (fragment/TCP reassembly) logicand protocol parsers (for HTTP and NFS) from Bro [34].We implement a custom flow monitoring system. Werealize a signature-based IDS porting Bro’s signaturematching module. We also built a simple custom Clickmodule for TFTP traffic. The load balancer is a layer-7application that assigns HTTP requests to different back-end servers by rewriting packets. The cache mimics ac-tions in a caching proxy (i.e., storing and looking up re-quests in cache), but does not rewrite packets.

While Bro’s modular design made it a very usefulstarting point, its intended use is a standalone IDS whileCoMb envisions reusing modules across multiple appli-cations from different vendors. This led to one key dif-ference. Modules in Bro are tightly integrated; lower lay-ers are aware of the higher layers using them and “push”data to them. We avoid this tight coupling between themodules and instead implement a “pull” model wherelower layers expose well-defined interfaces using whichhigher-layer functions obtain relevant data structures.

Supporting standalone applications: Last, we focuson how a CoMb box supports standalone middlebox ap-plications (e.g., Snort, Squid). We run the standalone ap-plications as separate processes over our pshim (whichruns in kernel-mode Click). The pshim copies packetsinto a shared memory region, readable by these applica-tion processes. Application processes can access thesein one of two modes. If we have access to the appli-cation source, we use minor source modifications; e.g.,in Snort we replace libpcap calls with a memory readto this shared region. Otherwise, we emulate virtual net-work interfaces to run binary-only applications where wedo not have access to the source.

7 EvaluationOur evaluation addresses the following high-level ques-tions w.r.t. the benefits and overhead in CoMb:• Single-box benefits What reuse benefits does consol-

idating applications on the same box provide? (Sec-tion 7.1)

• Single-box overhead Does consolidating applicationsaffect performance and extensibility? (Section 7.2)

• Network-wide benefits What are the benefits that net-work administrators can realize using CoMb? (Sec-tion 7.3)

Application Dependency chain Contribution (%)Flowmon Session 73Signature Session 26

Load Balancer HTTP,Session 88Cache HTTP,Session 54Cache NFS,Session 50Cache TFTP,Session 36

Table 3: Contribution of reusable modules

• Network-wide overhead How practical and efficientis CoMb’s controller? (Section 7.4)

7.1 Potential for reuse

First, we measure the potential for processing reuseby refactoring middlebox applications. As Section 3.2showed, the savings from reuse depends both on theprocessing footprints of reusable modules and the spe-cific traffic patterns/overlap. Here, we focus only on theformer and defer the combined effect to the network-wide evaluation (Section 7.3). We use real packet traces(with full payloads) for these benchmarks.8 Because weare only interested in the relative contribution, we runthese benchmarks with a single userlevel thread in Click.We use PAPI9 to measure the number of CPU cyclesper-packet each module uses. Note that an applicationlike Cache uses different processing chains (e.g., Cache-HTTP-session vs. Cache-NFS-session); the relative con-tribution depends on the sequence. Table 3 shows that thereusable modules contribute a significant fraction, 26-88%, of the overall processing across the different ap-plications.

7.2 CoMb single-box performance

We tackle three concerns in this section: (1) What over-head does CoMb add for running individual applica-tions? (2) Does CoMb scale well as traffic rates in-crease?, and (3) Does application performance sufferwhen administrators want to add new functionality?

For the following experiments, we report throughputmeasurements using the same full-payload packet tracesfrom Section 7.1 on our prototype CoMb server with twoIntel Westmere CPUs each with four cores at 3.47GHz(X5677) and 48GB memory. (The results are consistentwith other synthetic traces as well.)

8From https://domex.nps.edu/corp/scenarios/2009-m57/net/; weare not aware of other traces with full payloads.

9http://icl.cs.utk.edu/papi/

11

Application architecture Overhead (%)(instance) Shim-simple Shim-hash

Standalone (Snort) -61 -58Modular (IPSec) 0 0.73

Modular (RE [12]) 0 0.62

Table 4: Performance overhead of the shim layer for dif-ferent middlebox applications

7.2.1 Shim Overhead

Recall from Section 6 that CoMb supports two types ofmiddlebox software: (1) standalone applications (e.g.,Snort), and (2) modular applications in Click. Table 4shows the overhead of running a representative middle-box application from each class in CoMb on a single corein our platform. We show two scenarios, one where allclassification occurs in hardware (labeled shim-simple)and when the pshim runs an additional hash-based checkas discussed in Section 6 (labeled shim-hash). For mid-dlebox modules in Click, shim-simple imposes zero over-head. Interestingly, the throughput for Snort is betterthan its native performance. The reason is that Click’spacket capture routines are more efficient than nativeSnort (libpcap or daq). We also see that shim-hashadds only a small overhead over shim-simple. This re-sult shows that running applications in CoMb imposesminimal overhead.

7.2.2 Performance under consolidation

Next, we study the effect of adding more cores andadding more applications. For brevity, we only show re-sults for shim-simple. For these experiments, we use astandalone application process using the Snort IDS. Toemulate adding new functionality, we create duplicate in-stances of Snort. We found similar results with hetero-geneous applications too. At a high-level, we find thatconsolidation in CoMb does not introduce contentionbottlenecks across applications. This may surprise some.A detailed understanding of contention effects is an inde-pendent topic of interest and a parallel submission takesan in-depth look at the issue, explaining why contentioneffects have minimal impact for networking applicationson x86 hardware [25].

Scaling: Figure 10 shows the effect of adding morecores to the platform with a fixed hyperapp of length two(i.e., two Snort processes in sequence).

As a point of comparison, we also evaluate a vir-tual middlebox appliance architecture [15], where eachSnort instance runs in a separate VM on top of the XenVMM hypervisor. To provide high I/O throughput to theVM setup, we utilize the SR-IOV capability in the hard-ware [8]. We confirmed that I/O was not a bottleneck; we

1000

4000

7000

10000

2 3 4 5 6 7 8

Thro

ughp

ut (M

bps)

Number of cores available

VM-basedCoMb

Figure 10: Throughput vs. number of cores

0

0.2

0.4

0.6

0.8

1

2 3 4 5 6 7 8

Nor

mal

ized

thro

ughp

ut

Number of applications per packet

IdealCoMb

VM

Figure 11: Throughput reduction as more applicationsneed to run on each packet

were able to achieve a throughput of around 7.8 Gbps ona single VM with a single CPU core which is consistentwith state-of-art VM-based I/O numbers [41]. Further,we use the vSwitching capability of the NIC to trans-fer packets between multiple VM-based application in-stances [5]. (Unlike CoMb where we interpose a Click-based shim between applications.) As in Section 5.2, weneed to decide between the app-per-core vs. hyperapp-per-core design for the VM setup. We saw that app-per-core is significantly better (2×) for the VM case be-cause context switches between VMs are expensive andbecause switching between VMs is in-hardware in oursetup (i.e. vSwitching) the overhead of moving packetsacross cores is negligible (not shown). Thus, we conser-vatively use the app-per-core design for the VM setup.

We make three main observations. First CoMb’sthroughput with this real IDS/IPS (typically consideredvery resource intensive) is >10 Gbps on our 8-core plat-form; this is comparable to vendor datasheets [3]. Sec-ond, CoMb exhibits a reasonable scaling property sim-ilar to prior results on multi-core platforms [11]. Thissuggests that adapting CoMb to higher traffic rates sim-ply requires a hardware platform with more processingcores, and does not need any significant re-engineering.Finally, CoMb’s throughput is 5× better than the VMcase. While the performance of virtual network appli-ances is under active research, these are consistent withstate-of-art numbers [2].

Adding more functionality: Figure 11 evaluates the

12

impact of running more applications per-packet; e.g., inresponse to policy changes. Here, we normalize through-put w.r.t. to a single application. The ideal throughputdegradation as we add more applications is the 1

k curve;given the same resources running k applications needsk-times as much work. CoMb’s normalized throughputis marginally better than this ideal curve because con-solidation amortizes fixed costs w.r.t. packet capture andcopying packets to the applications. Even the VM case isonly marginally worse than ideal. (This further suggeststhat our VM-based setup is close to ideal without any se-rious bottlenecks.) This confirms that CoMb allows ad-ministrators to easily add new middlebox functionality.

7.3 CoMb’s Network-wide benefits

Setup: Next, we evaluate the network-wide benefits thatCoMb offers via reuse, multiplexing, and spatial distri-bution. For this evaluation, we use real topologies fromeducational backbones and the Enterprise network, andPoP-level AS topologies from Rocketfuel. To obtain re-alistic time-varying traffic patterns, we use the followingapproach. We use traffic matrices for Internet210 to com-pute empirical variability distributions for each elementin a traffic matrix; e.g., the probability that the volume isbetween 0.6 and 0.8 the mean. Then, using these empir-ical distributions, we generate time-varying traffic matri-ces for the remaining AS-level topologies using a gravitymodel to capture the mean volume [36]. For the Enter-prise network, we replay real traffic matrices.

In the following results, we report the benefits thatCoMb provides relative to today’s standalone middle-box deployments with the four applications from Table 3:flow monitoring, load balancer, IDS, and cache. To em-ulate current deployments, we use the same applicationsbut without reusing modules. For each application, weuse public configurations to identify the application portsof traffic they process. To capture changes in per-portvolume over time, we replay the empirical variabilitybased on flow-level traces from Internet2. We begin witha scenario where all four applications can be spatiallydistributed before the case when two of these are topo-logically constrained.

Provisioning: With the above setup, we consider a pro-visioning exercise from Section 4 to minimize the re-sources needed to handle the time-varying traffic patterns(across 200 epochs). The metric of interest here is therelative savings that CoMb provides vs. today’s deploy-ments where all applications run as independent devicesonly at the ingress: Coststandalone,ingress

CostCoMb. (Cost here repre-

sents ∑n,r Provn,r from Section 4.) We try two CoMb

10http://www.cs.utexas.edu/˜yzhang/research/AbileneTM

1

1.4

1.8

2.2

2.6

AbileneGeant

EnterpriseAS1221

AS3257AS1239

Rel

ativ

e sa

ving

s With reuseWithout reuse

Figure 12: Reduction in provisioning cost with CoMb

Topology Unconstrained Two-step Ingress-onlyInternet2 1.81 1.62 1.41

Geant 2.20 1.71 1.42Enterprise 2.58 1.76 1.45AS1221 2.17 1.69 1.41AS3257 1.85 1.63 1.42AS1239 2.11 1.69 1.43

Table 5: Relative savings in provisioning when Cacheand Load balancer are spatially constrained

configurations: with and without reusable modules. Inthe latter case, the middlebox applications share the samehardware but not software. Figure 12 shows that acrossthe different topologies CoMb with reuse provides 1.8–2.5× savings relative to today’s deployment strategies.For the Enterprise setting, even CoMb without reuse pro-vides close to 1.8× savings.

1

1.4

1.8

2.2

2.6

AbileneGeant

EnterpriseAS1221

AS3257AS1239

Rel

ativ

e sa

ving

s PathEdge

Ingress

Figure 13: Impact of spatial distribution on CoMb’s re-duction in provisioning cost

Figure 13 studies the impact of spatial distribution bycomparing three strategies for distributing middlebox re-sponsibilities: full path (labeled Path), either ingress oregress (labeled Edge), or only the Ingress. Interestingly,Edge is very close to Path. To explore this further, wealso tried a strategy of picking a random second nodefor each path. We found that this is again very close toPath (not shown). In other words, for Edge the egress isnot special; the key is having one more node to distributethe load. We conjecture that this is akin to the “powerof two random choices” observation [28] and plan to ex-plore this in future work.

Load balancing: Equally of interest is the benefit thatCoMb provides in adapting to changing traffic workloads

13

0

5

10

15

20

25

AbileneGeant

EnterpriseAS1221

AS3257AS1239

Rel

ativ

e sa

ving

s

UniformVolume

Figure 14: Relative reduction in the maximum load

under a fixed provisioning strategy. Here, our metric ofinterest is the maximum load across the network, and wemeasure the relative benefit as: MaxLoadstandalone,ingress

MaxLoadCoMb. We

consider two network-wide provisioning strategies: eachlocation is provisioned with the same resources (Uni-form) or resources proportional to the average volumeit sees (labeled Volume). For the standalone case, weassume resources are split between applications propor-tional to their workload. Volume+workload proportionalprovisioning likely reflects current practice. We considerthe Uniform case because it is unclear if this strategy isalways better; e.g., it could be better on average, but haveworse “tail” performance (see Figure 14).

As before, we generate time-varying traffic patternsover 200 epochs. For each epoch, we measure theabove relative load metric. For each topology, Fig-ure 14 summarizes the distribution of this metric (acrossepochs) with a box-and-whiskers plot showing the25%ile, median, and 75%ile (box), and the min/max val-ues (whiskers). We see that CoMb reduces the maximumload by > 2× and the reduction can be as high as 25×,suggesting that CoMb can better handle traffic variabilitycompared to current middlebox deployments.

Topological constraints: Next, we consider a scenariowhen some applications cannot be spatially distributed.Specifically, we constrain Cache and the Load balancerto only run at the ingress for each path. One option in thiscase is to pin all middlebox applications to the ingress toexploit reuse but ignore spatial distribution. While CoMbprovides non-trivial savings (1.4×) even in this case, weexplore opportunities for further benefits. To this end,we extend the formulation in Section 4.3 to perform atwo-step optimization. In the first step, we assign thetopologically constrained applications to their requiredlocations. In the second, we assign the remaining ap-plications, which can be distributed, as in Section 4.3with a slight twist – we reduce the hyperapp-footprintson locations where they can reuse modules with the con-strained applications. For example, if we have the hy-perapp Cache-IDS, with Cache pinned to the ingress, we

Topology Path Edge IngressInternet2 0.87 0.87 0.54

Geant 1.49 1.25 0.55Enterprise 1.02 1.02 0.54AS1221 1.33 1.33 0.54AS3257 0.68 0.68 0.55AS1239 1.26 1.26 0.55

Table 6: Relative size of the largest CoMb box. A highervalue here means that the standalone case needs a largerbox compare to CoMb

Topology #PoPs Time (s)Strawman-LP hyperapp

Internet2 11 687.68 0.05Geant 22 3455.28 0.24

Enterprise 23 2371.87 0.25AS3257 41 1873.32 0.78AS1221 44 3145.77 1.08AS1239 52 9207.78 1.58

Table 7: Time to compute the optimal solution

reduce the IDS footprint on the ingress. Table 5 showsthat this two-step procedure is able to improve the sav-ings 20-30% compared to an ingress-only solution.

Does CoMb need bigger boxes? A final concern isthat consolidation may require “beefier” boxes (e.g., inthe network core). To alleviate this concern, Table 6compares the processing capacity of the largest stan-dalone box needed across the network to that of thelargest CoMb box: Largeststandalone

LargestCoMb. We see that the largest

standalone box is actually larger than CoMb for manytopologies. Even without distribution, the largest CoMbbox is only 1

0.55 = 1.8×, which is quite manageable.

7.4 CoMb controller performance

Last, we focus on the performance of the network con-troller and address two concerns: (1) Is the optimizationfast enough to respond to traffic dynamics (on the orderof minutes)? and (2) How close to the theoretical optimalis the reformulation from Section 4.3?

Table 7 shows the time to run the optimization fromSection 4 using the CPLEX LP solver on a single coreIntel(R) Xeon(TM) 3.2GHz CPU. To put our reformula-tion in context, we also show the time to solve an LP-relaxation for the strawman. The reformulation is fourorders of magnitude faster than even this relaxed straw-man, and takes 1.58s to recompute network-wide config-urations for a 52-node topology. Given that we expect acontroller to recompute configurations on the order of afew minutes [13], this is quite reasonable.

We also measured the optimality gap between the LP-relaxation and the reformulation over a range of scenar-

14

ios. Because the LP-optimal is less than the true opti-mal solution, this gap is actually an upper bound. Acrossall topologies, this upper bound on the optimality gap is≤ 0.19% for the load balancing and ≤ 0.1% for the pro-visioning (not shown). Thus, our reformulation providesa tractable, yet near-optimal, alternative.

7.5 Summary of key resultsTo summarize, our evaluations show that CoMb:• has significant opportunities for reuse across applica-

tions (Table 3);• imposes minimal overhead for running middlebox ap-

plications (Table 4);• has 5× better throughput vs. virtualized middleboxes

(Figure 10);• reduces the provisioning cost 1.8–2.5× for a range of

real network settings (Figure 12);• reduces the maximum load 2–25× (Figure 14);• does not need much larger hardware (Table 6); and• CoMb’s controller is practical and efficient (Sec-

tion 7.4).

8 Discussion

Asymmetric paths: For session-level processing, eachmiddlebox must see both directions of the session. Thus,for traffic classes with asymmetric (or multiple) paths,we constrain the distribution only to the nodes commonto both directions; e.g., just the edge or the ingress case.

Placement constraints: From Table 1, we speculatethat roughly half the applications can be distributed (ex-cept WAN optimizers, VPN gateways, and possibly loadbalancers). We sketched and evaluated a scenario in Sec-tion 7.3 when half the applications are topologically con-strained and showed that CoMb still provides substantialsavings. As future work, we plan to explore a detailedconfiguration in collaboration with the enterprise opera-tors.

Business concerns: At first glance, CoMb appears tochange business models for vendors. The reality, how-ever, is that other factors (e.g., cloud computing) are al-ready causing them to release “virtual appliances” [7].Also note that CoMb’s general design allows vendors toinnovate both at the platform and application level.

9 Related Work

Integrating middleboxes: Previous work discusses tobetter expose middleboxes to administrators (e.g., [6,22]). CoMb focuses on the orthogonal problem of con-solidating middlebox deployments.

Middlebox measurements: Studies have measured theend-to-end impact of middleboxes [17] and interactionswith transport protocols [27]. There are few studies onhow middleboxes are deployed and managed. Our mea-surements in Section 2 and high-level opportunities inSection 3 appear in an upcoming workshop paper [39].This work goes beyond the motivation to demonstrate apractical design and implementation and quantifies thesingle-box and network-wide benefits of a consolidatedmiddlebox architecture.

General-purpose network elements: There are manyefforts in building commodity routers and switchers us-ing x86 CPUs [26, 32, 15], GPUs [20], and merchantswitch silicon [19]. CoMb can exploit these advancesin hardware design as well. It is worth noting that theresource management challenges we address in CoMbalso apply to these efforts, if the extensibility they enableleads to diversity in traffic processing.

Rethinking middlebox design: CoMb shares themotivation of rethinking middlebox design with Flow-stream [15] and xOMB [18]; these efforts further con-firm the significance of this problem space. Flow-Stream presents a high-level architecture using Open-Flow for policy routing and runs middlebox as a separateVM [15]. Section 7.2 shows that VM-based middleboxeshave much lower throughput. Further, a VM approachprecludes opportunities for reuse. xOMB presents asoftware model for extensible middleboxes [18]. Thekey difference is that CoMb addresses network-wideand platform-level resource management challenges thatarise with consolidation that neither FlowStream norxOMB seek to address. CoMb also provides a moregeneral management framework to support both modu-lar and standalone middlebox functions.

Network management: CoMb’s controller follows inthe spirit of efforts showing the benefits of centralizationin routing, access control, and monitoring (e.g., [14, 33,24, 23]). The use of optimization arises in other man-agement applications like traffic engineering and moni-toring (e.g., [35]). However, reuse and policy dependen-cies that arise in the context of consolidating middleboxmanagement create new challenges for management andoptimization unique to our context.

10 ConclusionsWe presented a new middlebox architecture calledCoMb, which systematically applies the design princi-ple of consolidation, both in building individual appli-ances and in managing an ensemble of these across anetwork. In addition to the qualitative benefits w.r.t. ex-tensibility, ease of management, and reduction in device

15

sprawl, consolidation provides new opportunities for re-source savings via application multiplexing, softwarereuse, and spatial distribution. We addressed the keyresource management and implementation challenges inorder to leverage these benefits in practice. Using a pro-totype implementation in Click, we show that CoMb re-duces the network provisioning cost by up to 2.5×, de-creases the load skew by up to 25×, and imposes mini-mal overhead for running middlebox applications.

11 References[1] http://www.snort.org.[2] Astaro security gateway.

http://bit.ly/o6FX0m.[3] Big-ip hardware datasheet.

http://bit.ly/Yv3Pc.[4] Crossbeam network consolidation.

http://bit.ly/qlotDK.[5] Intel 82599 10 gigabit ethernet.

http://bit.ly/qSOOcJ.[6] Middlebox Communications (MIDCOM) Protocol

Semantics. RFC 3989.[7] Silver Peak software WAN optimization.

http://bit.ly/nCBRst.[8] Sr-iov. http://bit.ly/fnpry4.[9] Untangle. www.untangle.com.

[10] World enterprise network security markets.http://bit.ly/gYW4Us.

[11] S. Boyd-Wickizer et al. An Analysis of LinuxScalability to Many Cores. In Proc. OSDI, 2010.

[12] A. Anand et al. Packet Caches on Routers: TheImplications of Universal Redundant TrafficElimination. In Proc. of SIGCOMM, 2008.

[13] A. Feldmann et al. Deriving Traffic Demands forOperational IP Networks: Methodology andExperience. In Proc. SIGCOMM, 2000.

[14] A. Greenberg et al. A Clean Slate 4D Approach toNetwork Control and Management. ACMSIGCOMM CCR, 35(5), Oct. 2005.

[15] A. Greenlagh et al. Flow Processing and the Riseof Commodity Network Hardware. ACM CCR,Apr. 2009.

[16] A. Shieh et al. SideCar: Building ProgrammableDatacenter Networks without ProgrammableSwitches. In Proc. HotNets, 2010.

[17] M. Allman. On the Performance of Middleboxes.In Proc. IMC, 2003.

[18] J. Anderson and A. Vahdat. xOMB: eXtensibleOpen MiddleBoxes. Unpublished Manuscript.

[19] G. Lu et al. ServerSwitch: A Programmable andHigh Performance Platform for Data CenterNetworks. In Proc. NSDI, 2011.

[20] S. Han, K. Jang, K. Park, and S. Moon.

PacketShader: a GPU-Accelerated SoftwareRouter. In Proc. SIGCOMM, 2010.

[21] D. Joseph and I. Stoica. Modeling middleboxes.IEEE Network, 2008.

[22] D. A. Joseph, A. Tavakoli, and I. Stoica. APolicy-aware Switching Layer for Data Centers. InProc. SIGCOMM, 2008.

[23] M. Caesar et al. Design and implementation of aRouting Control Platform. In Proc. of NSDI, 2005.

[24] M. Casado et al. SANE: A Protection Architecturefor Enterprise Networks. In USENIX Security,2006.

[25] M. Dobrescu et al. Managing Resource Contentionin Software-based Networking (Mostly byIgnoring It). under submission.

[26] M. Dobrescu et al. RouteBricks: ExploitingParallelism to Scale Software Routers. In Proc.SOSP, 2009.

[27] M. Honda et al. Is it still possible to extend TCP?In Proc. IMC, 2011.

[28] M. Mitzenmacher et al. The Power of TwoRandom Choices: A Survey of Techniques andResults . Handbook of Randomized Computing,2000.

[29] M. Vallentin et al. The NIDS Cluster: Scalable,Stateful Network Intrusion Detection onCommodity Hardware. In Proc. RAID, 2007.

[30] Mihai Dobrescu et al. Controlling Parallelism inMulti-core Software Routers. In Proc. PRESTO,2010.

[31] R. Morris, E. Kohler, J. Jannotti, and M. F.Kaashoek. The click modular router. SIGOPSOperating Systems Review, 33(5):217–231, 1999.

[32] N. Egi et al. Towards high performance virtualrouters on commodity hardware. In Proc.CoNEXT, 2008.

[33] N. Gude et al. NOX: Towards an Operating Systemfor Networks. ACM SIGCOMM CCR, July 2008.

[34] V. Paxson. Bro: A System for Detecting NetworkIntruders in Real-Time. In Proc. USENIX SecuritySymposium, 1998.

[35] M. Roughan. Robust network planning. Chapter 5,Guide to Reliable Internet Services andApplications.

[36] M. Roughan. Simplifying the Synthesis of InternetTraffic Matrices. ACM SIGCOMM CCR, 35(5),2005.

[37] T. Benson et al. Demystifying configurationchallenges and trade-offs in network-based ispservices. In Proc. SIGCOMM, 2011.

[38] V. Sekar et al. cSamp: A System forNetwork-Wide Flow Monitoring. In Proc. ofNSDI, 2008.

[39] V. Sekar et al. The Middlebox Manifesto:

16

Enabling Innovation in Middlebox Deployments.In Proc. HotNets, 2011.

[40] Y. Ma et al. Leveraging Parallelism forMulti-dimensional Packet Classification onSoftware Routers. In Proc. SIGMETRICS, 2010.

[41] H. Zhiteng. I/O Virtualization Performance.http://bit.ly/qPUPeq.

A Suboptimality of hyperapp model

Figure 15: Example to illustrate why the hyperapp modelmay be suboptimal. Each action has a unit footprint F.

Figure 15 shows an example where the hyperappmodel might be suboptimal. Here, N1/N2 have to pro-cess some traffic that cannot be offloaded, but the traf-fic on the path N1-N2 can be distributed. In the gen-eral model from Section 4, N1 and N2 require 6 resourceunits each. With this provisioning, they can handle thetraffic across the three epochs. At time 1, N1 uses all ofits 6 resource units for the local traffic; similarly at time2, N2 uses its resources for local traffic. The traffic fromN1 to N2 is assigned to N2 at T=1 and to N1 at T=2. AtT=3, the slight increase in the local traffic means that itis better to run the A1/A2 on different nodes. In otherwords, in this example, duplicating the common actionmight be better. The maximal model also requires 6 unitseach at N1 and N2 for handling the traffic at T=1 andT=2. At T=3, there is a problem. Because it constrainsA1/A2 on the same node, it is forced to process the traf-fic on N1-N2 entirely at N1 or N2. Thus, it requires 3eadditional resources.

17

Documents

Design and Implementation of a Consolidated Middlebox