Ally: OS-Transparent Packet Inspection using Sequestered Cores · for both private Enterprise datacenters and public cloud computing datacenters, which support a fast-changing multitude

Keyword(s): Abstract:

Ally: OS-Transparent Packet Inspection using Sequestered Cores

Jen-Cheng Huang, Matteo Monchiero, Yoshio Turner, Hsien-Hsin S. Lee

HP LaboratoriesHPL-2011-124

multicore; packet inspection; isolation; computer architecture; multicore partitioning

This paper presents Ally, a server platform architecture that supports compute-intensive managementservices on multicore processors. Ally introduces simple hardware mechanisms to sequester cores to run aseparate software environment dedicated to management tasks, including packet processing softwareappliances (e.g. for Deep Packet Inspection, DPI) with efficient mechanisms to safely and transparentlyintercept network packets. Ally enables distributed deployment of compute-intensive management servicesthroughout a datacenter. Importantly, it uniquely allows these services to be deployed independent of thearbitrary OSs and/or hypervisor that users may choose to run on the remaining cores, with hardwareisolation preventing the host environment from tampering with the management environment. Experimentsusing full system emulation and a Linux-based prototype validate Ally functionality and demonstrate lowoverhead packet interception; e.g., using Ally to host the well-known Snort packet inspection softwareincurs less overhead than deploying Snort as a Xen virtual machine appliance, resulting in up to 2ximprovement in throughput for some workloads.

External Posting Date: August 21, 2011 [Fulltext] Approved for External PublicationInternal Posting Date: August 21, 2011 [Fulltext]

Copyright 2011 Hewlett-Packard Development Company, L.P.

Ally: OS-Transparent Packet Inspection Using SequesteredCores

Jen-Cheng Huang* Matteo Monchiero† Yoshio Turner‡ Hsien-Hsin S. Lee*

*Georgia Institute of Technology †Intel Labs ‡HP Labstommy24,[email protected] [email protected] [email protected]

ABSTRACTThis paper presents Ally, a server platform architecturethat supports compute-intensive management services onmulticore processors. Ally introduces simple hardwaremechanisms to sequester cores to run a separate software en-vironment dedicated to management tasks, including packetprocessing software appliances (e.g. for Deep Packet In-spection, DPI) with efficient mechanisms to safely andtransparently intercept network packets. Ally enables dis-tributed deployment of compute-intensive management ser-vices throughout a datacenter. Importantly, it uniquelyallows these services to be deployed independent of thearbitrary OSs and/or hypervisor that users may choose to runon the remaining cores, with hardware isolation preventingthe host environment from tampering with the managementenvironment. Experiments using full system emulationand a Linux-based prototype validate Ally functionality anddemonstrate low overhead packet interception; e.g., usingAlly to host the well-known Snort packet inspection softwareincurs less overhead than deploying Snort as a Xen virtualmachine appliance, resulting in up to 2x improvement inthroughput for some workloads.

1. INTRODUCTIONPacket processing services like Deep Packet Inspec-

tion (DPI) are used in datacenters for functions in-cluding intrusion detection, content insertion, perfor-mance monitoring, traffic classification, and flow man-agement [1]. These services are provided by specializedappliances deployed at the boundary (e.g. wide-areanetwork gateway) between the datacenter and the ex-ternal network to process traffic entering or exiting thedatacenter.

This approach is poorly suited to processing trafficthat remains local to the datacenter, even though localtraffic is growing in importance. To exploit economiesof scale, modern datacenters consolidate many inter-connected applications and services. This trend is clearfor both private Enterprise datacenters and public cloudcomputing datacenters, which support a fast-changingmultitude of mutually untrusted tenants. These models

call for packet processing services to be flexibly placedon-demand throughout a datacenter rather than justat the external boundary. An attractive approachwould co-locate packet processing with user applicationsthroughout a datacenter [2], enabling packet processingservices to leverage the abundant compute and memoryresources on commodity servers.

This paper presents Ally, a server architecture thatprovides the basic building block for distributed packetprocessing services. Ally adds hardware partitioning toa server, enabling it to run two parallel and independentsoftware stacks – one stack for user applications and OSor hypervisor, and a second software stack for packetprocessing. Ally also adds processor support for OS-transparent network packet interception by the packetprocessing stack. Finally, Ally reuses the existing man-agement network interface on the server for datacenteradministrators to deploy and control packet processingservices.

As we enter the era of processors with a large numberof inexpensive cores, like Intel Many Integrated Core(MIC) architecture [3] which integrates more than 50lightweight cores, the approach of Ally is to partition aprocessor into a “privileged” set of cores that executepacket processing services, and an “unprivileged” setof cores that execute user applications and operatingsystems. Completely separate software stacks can runin parallel in the two partitions. Ally’s hardwareextensions are intentionally minimal. They are invokedonly on low frequency I/O actions (interrupts andmemory-mapped I/O accesses), with no impact on othercomputation and negligible impact on processor clockspeed and power consumption.

Using multicore partitioning for packet processing hasimportant advantages over the alternative approachesof enhancing NICs or deploying packet processing inhypervisor-hosted virtual machines (VMs). Comparedto enhancing the NIC, using CPU cores allows new func-tionality to be provided in software without hardwarechanges, and can leverage the large capacity of servermemory and the high performance of modern proces-sors. General-purpose cores may also be used in futureprocessors to control on-chip hardware accelerators like

1

GPUs or regular expression engines. Moreover, a CPU-based approach is not limited to packet processingbut could also support management services relatedto storage, power, etc. Compared to using VMs forisolation, Ally’s hardware partitioning may be morereliable by avoiding resource and fate sharing, andby presenting a narrower attack surface than the fullset of hypercall APIs. Hardware partitioning alsopreserves existing software generality instead of forcingdatacenters to use a common hypervisor on every server.Ally preserves and extends existing server managementinterfaces and capabilities and supports both virtualizedand non-virtualized OSs.

We carried out functional validation of Ally hardwareextensions using the QEMU [4] full system emulator.To evaluate Ally’s performance, we built a softwareprototype that modifies the Linux kernel to emulateAlly hardware functions. Using the well-known Snortpacket inspection software, our experimental resultsshow that Ally+Snort has acceptably low overhead forpacket interception. For the workloads we studied, Allyachieves from 56% to 100% higher throughput than asystem that uses a Xen hypervisor driver domain totransparently intercept and inspect packets.

The rest of the paper is organized as follows. Sec-tion 2 describes our assumptions about the platformarchitecture and the NIC-OS model. Section 3 de-scribes the trust and threat model motivating the designof Ally. Section 4 describes the Ally architecture,including core sequestering, memory protection, andpacket interception mechanisms. Section 5 presents theexperimental evaluation. Section 6 discusses relatedwork, and Section 7 concludes.

2. ASSUMPTIONSThis section describes our baseline architecture and

reviews the interaction between an OS and a NIC.

2.1 Platform ArchitectureFigure 1 is a simplified block diagram of a generic

platform architecture, built around a multicore proces-sor with many cores sharing a Last Level Cache (LLC).Cores are logically grouped into a privileged partitionand an unprivileged partition. Privileged cores runmanagement applications, while unprivileged cores runOS/hypervisor and user applications. As we focus onDPI as a primary use-case, we often refer to the coresin the privileged partition as DPI cores and the coresin the unprivileged partition as OS cores.

Each core has a Memory Management Unit (MMU)for virtual-to-physical address translation, and a localinterrupt controller (Local APIC). An integrated North-bridge has a Memory (DRAM) Controller, PCIe I/Ocontroller, Interrupt Unit to route interrupt signals tothe proper Local APIC from the IOAPIC or through

Figure 1: Target platform architecture

Message Signaled Interupts (MSI), I/O Memory Man-agement Unit (IOMMU), and a point-to-point highspeed link (e.g., Intel DMI) connecting to a PlatformController Hub (PCH).

A service processor (e.g. Intel iAMT [5] or HPiLO [6]) is used to manage the platform. The serviceprocessor interfaces to the CPU via the PCH. Theservice processor has a a dedicated network interfacetypically attached to a dedicated management network.With Ally, datacenter administrators use this manage-ment network to deploy packet processing services inthe privileged partition.

2.2 OS-NIC InteractionWe describe OS-NIC interaction using the popular

Intel Pro/1000 NIC. The same description applies tomost modern NICs with only minor differences.

The NIC has transmit/receive buffers and a setof device registers mapped into the system memoryaddress space. One register points to a transmitdescriptor queue and another one to a receive descriptorqueue. Both queues reside in the host main memory.Each descriptor contains a pointer to packet data thatmust be transmitted or to receive buffers. For eachqueue, memory-mapped Head Pointer and Tail Pointerregisters record the head and tail position in the queueas buffers are posted and as packets are received andtransmitted.

The OS advances the Tail Pointer to notify theNIC that some descriptors have been posted and awaitbeing consumed (either transmitted or received). TheNIC uses Direct Memory Access (DMA) to transferdescriptors and packets between the NIC buffers andthe main memory. Once a transfer has completed, theNIC updates the Head Pointer and raises a completioninterrupt. Most NICs use interrupt coalesce timersto reduce interrupt frequency and associated software

2

processing overheads by waiting for a batch of packetI/Os to be completed before signaling an interrupt.The OS may determine which I/Os have completed byreading the Head Pointer or by inspecting the status ofthe queued descriptors.

3. TRUST AND THREAT MODELThe trusted entities of an Ally-based system are

the hardware, the firmware (BIOS), and the softwarerunning on the privileged (DPI) cores. We assume thatthe BIOS FLASH chips include hardware protectionagainst re-writing by OSs. For maximum security,the firmware and the privileged cores’ software are notdirectly exposed to the external network. Instead, adedicated management channel is provided to configurethe BIOS and to deploy services on the DPI cores,including platform-specific device drivers. This chan-nel uses a distinct network interface connected to amanagement network via a service processor, as shownin Figure 1. Additionally, the DPI cores can use thischannel to raise management alerts when suspicioustraffic is detected. Each hardware extension in Allyis controlled through memory-mapped I/O (MMIO)configuration registers accessible only to the privilegedcores and exposed to the datacenter via the managementchannel.

Ally can be used to deploy a DPI engine to pro-tect against transmission and reception of maliciousnetwork traffic for the workloads running on unpriv-ileged cores. As described in Section 4.3, Ally offersconfigurable mechanisms for packet interception. Thisallows administrators to select the degree of protectionprovided depending on the perceived threat level ofthe workloads running in the unprivileged domain,allowing a configurable trade-off between protection andcost/performance.

Denial of service attacks where software on unprivi-leged cores guesses the Local APIC ID of the DPI-Coreand generates corresponding interrupts can be avoidedby having the DPI cores configure their InterruptDescriptor Tables to ignore all interrupts from non-authorized devices.

4. ARCHITECTUREAlly provides an ensemble of hardware, firmware, and

software mechanisms. Ally creates two partitions: aprivileged partition for DPI cores, and an unprivilegedpartition for OS cores. Physical memory is split in twoseparate regions, one for each partition, and indepen-dent software stacks boot in each partition. Memoryprotection is extended to prevent OS cores from readingor writing the memory region of the DPI cores, whileallowing DPI cores in the privileged partition to accessOS memory. Simple mechanisms enable DPI cores tointercept packets exchanged between OS cores and NIC.

Table 1: Hardware/Firmware Modifications

Unit ModificationOS cores’ MMU Prevent memory accesses to DPI memory

from OS coresIOMMU Prevent non authorized DMA to DPI

memoryIDT Prevent non authorized interrupts

from OS cores to DPI coresBIOS Boot DPI appliance and hide DPI cores/

memoryInterrupt Unit Redirect NIC interrupts to DPI coresOS cores’ MMU Redirect MMIO register accesses to

DPI memoryAll units Extra MMIO registers to configure Ally

functionalities

Table 1 summarizes the hardware modifications.Ally’s approach of using hardware partitioning lever-

ages the current trend toward very high core count pro-cessors. Some cores can be assigned to packet processingon behalf of other cores, enabling better utilization ofprocessor resources. Using Ally, a datacenter managercan adjust the number of sequestered cores at boottime depending on system load and the computationalintensity of the DPI engine.

4.1 Core SequestrationCore sequestration enables distinct software environ-

ments to be booted in each partition. Conventional Intelmultiprocessor (MP-compliant) systems use a standardboot procedure at startup. Ally works with this bootprocedure to launch the DPI environment on the DPIcores and to conceal the DPI cores from the OS cores.

In the conventional procedure, a core (called BSPcore) wakes up the other cores (called AP cores). EachAP core runs code loaded from BIOS which executesa self-test and enters the core’s unique identifier (LocalAPIC ID) into the ACPI’s Multiple APIC DescriptionTable (MADT) and the MP configuration table. TheOS uses the MADT and the MP configuration table tofind all information needed to discover and communicatewith a core. After initialization, each AP core halts andwaits for an Inter-Processor Interrupt (IPI) to resumeexecution.

Ally uses a modified BIOS that selects some AP coresas DPI cores and loads a custom initialization procedureonto each DPI core. The custom procedure deletesthe DPI core’s entry from the MADT and the MPconfiguration table, thus hiding the DPI core from theOS cores. The DPI cores then load the custom InterruptDescriptor Table and begin to load the DPI softwareenvironment.

The DPI cores interact with the built-in service pro-cessor to request the most up-to-date DPI applicationfrom a remote management server. The service proces-

3

Figure 2: Modified Memory Management Unit.Ally uses a modified TLB Miss Handler toprotect from unauthorized memory accesses

sor downloads an executable image via the managementnetwork into DPI memory for the DPI cores to execute.

4.2 Memory ProtectionAlly splits the physical address space in two regions

identified by a simple range check, i.e., checking whetherphysical addresses are above or below a dividing bound-ary line. Ally adds a memory-mapped I/O (MMIO)configuration register to the MMU for each core thatstores the boundary address between DPI and OS mem-ory. This range check is much simpler than adding a newlevel of address translation, and is fully compatible withthe existing address translation hierarchy, includingnested or extended page tables.

To prevent OS cores from accessing DPI memory, theMMU needs a simple address check on each hardwarepage table walk, to verify that the translated physicaladdress falls in an accessible region for the core thatcaused the access. The modified TLB Miss Handler(TMH), shown in Figure 2, verifies that any physicaladdress, which is being loaded in the TLB on an OScore, is not in the range of DPI memory. This addresscheck has very low overhead on the performance ofthe system. It has no overhead on TLB hits, sinceit happens only for TLB misses. In this case, theaddress comparison is likely to take a fraction of cycle,a negligible overhead for a TLB miss, which may haveorders of magnitude higher latency.

Ally uses a similar check for the I/O Memory Manage-ment Unit (IOMMU) to prevent non-authorized devicesfrom performing DMA writes to DPI memory. Bothchecks are also used to protect DPI memory in x86 realaddress mode (usually only needed early in the bootprocess).

4.3 Packet InterceptionAlly enables DPI cores to transparently intercept, ex-

amine and potentially modify packets in both transmitand receive directions between OS and NIC. Packetinterception is accomplished by virtualizing the NICdescriptor queues. Ally maintains a copy of each

descriptor queue in DPI memory and configures theNIC to use these queues (DPI queues) instead of thedescriptor queues in OS memory (OS queues) throughconfiguring the base pointers of descriptor queues. TheDPI cores are thus responsible for synchronizing theDPI queues with the OS queues.

DPI cores inspect packets referenced by the enqueueddescriptors. Once packet processing is done, DPIcores transfer descriptors between the virtual descriptorqueues seen by the device driver on OS cores and thereal NIC descriptor queues.

The DPI cores interpose between the OS and the NIC.This interposition is logically transparent to both theOS (or hypervisor) running on the OS core and theNIC because the usage of the descriptor queue is devicedependent, not OS dependent. Thus, no modificationis needed to the OS (including the device driver) or theNIC.

The DPI cores download modified NIC drivers viathe service processor. The modifications are mostly onthe transmit and receive paths, which previous workhas shown comprises only a small portion of the drivercode, and which can even be separated out using anautomated process [7] [8]. Modified drivers would likelybe provided by the server vendor to increase the valueof their server management components. Develop-ment costs would be reasonable, since each vendor ofdatacenter-class servers ships only a small set of NICmodels to enjoy large volume cost savings and simplifycertification.

Since an OS interacts with a NIC through inter-rupts and MMIO operations, Ally provides configurableredirection of these operations to DPI cores for I/Ointerception. For interrupt redirection, a DPI corespecifies which NIC interrupts (IOAPIC and MSI) areto be redirected, and the modified Interrupt Unit inthe Northbridge steers the selected interrupts to DPIcores. The software on DPI core uses Inter-ProcessorInterrupts (IPIs) to mimic NIC interrupts back to theOS.

For MMIO redirection, any MMIO access performedby an OS core and directed to the NIC descriptor queueregisters must be redirected to a reserved area of DPImemory. This leaves the DPI cores in control of the NICreal registers. The accesses that need to be redirectedinclude accesses to the base pointers of the descriptorqueues and the Tail/Head registers of each queue. Inaddition, accesses to interrupt-related registers mustalso be redirected. This allows the DPI cores to controlthe interrupt rate based on the packet processing speedand allows the OS cores to transparently use interruptmitigation techniques like Linux NAPI. Overall, thereshould be less than ten MMIO registers needed to beredirected.

Ally enhances the MMU and TLB to provide MMIO

4

Figure 3: Address redirection flow, involvingTLB redirection and access to the RedirectionTable cached in the Last Level Cache

access redirection. Figure 3 shows an efficient im-plementation. All MMIO remappings are specified ina Redirection Table in DPI memory. This table issmall, requiring less than ten entries per NIC. Eachentry specifies an MMIO address and the translatedaddress to the redirected location. On OS core TLBmiss, the TLB Miss Handler (TMH) performs the usualpage table walk. Since MMIO pages are by definitionuncacheable, the TMH checks if the resulting physicaladdress refers to an uncacheable page (and hence po-tentially an MMIO page). If the page is uncacheable,then the TMH checks the Redirection Table to see ifany address in the table belongs to the faulting page.If so, the TMH sets a new “redirection” bit in the TLBentry.

Redirection Table lookup also occurs when an OS coreperforms a memory access that results in a TLB hit toan entry that has the redirection bit set. The lookupdetermines if the specific address that was accessed hasa remapping. If a remapping is found, the MMU usesthe remapped address as the translated address, andraises an IPI to notify a DPI core of the access attemptby the OS core.

For normal memory accesses, the redirection bit is notset, incurring no extra overhead. For the minority of ac-cesses that need Redirection Table lookup, Ally uses theLast Level Cache (LLC) to cache the small RedirectionTable, speeding up table lookup. Overall, RedirectionTable lookup has negligible impact on silicon area, hasno impact on normal memory accesses, and slightlyincreases the latency of MMIO accesses comprising onlya small fraction of total memory accesses.

We next describe the steps taken by DPI cores tointercept packets and virtualize the device queues. Sincethe real Head/Tail Pointer registers in the NIC referto DPI queues, we call them DPI Head/Tail Pointers,while the virtual Head/Tail Pointers stored in the

reserved DPI memory region refer to OS queues, so wecall them OS Head/Tail Pointers.

Figure 4 illustrates the operation of the descriptorqueues on packet reception. The OS preallocatesdescriptors in the OS queue. A DPI core copies thedescriptors to the DPI queue and updates the DPITail Pointer, which is visible to the NIC, thus notifyingthe NIC that descriptors are available for the incomingpackets. The NIC copies the received packets, completesthe descriptors in the DPI queue, and updates the DPIHead Pointer. At this point the DPI processes thereceived packets. To allow the OS to consume thereceived descriptors, the DPI marks the descriptors ascomplete in the OS queue and updates the OS HeadPointer. Finally, the DPI sends an IPI to an OS core tonotify it that reception is complete. The OS can thusproceed to consume the received packets.

In the case of packet transmission, the OS postsdescriptors in its transmit queue and advances the TailPointer. Ally intercepts the write to the Tail Pointerwhich is not propagated to the NIC, but to the copy keptin DPI memory, i.e., the OS Tail Pointer. The OS core’sMMU raises an IPI to a DPI core, enabling the DPIcore to detect that the Tail Pointer has been updated.The DPI copies the newly posted descriptors from theOS queue to the DPI queue. It processes the packetsreferenced by the descriptors, and updates the DPI TailPointer. This Tail Pointer is visible to the NIC, whichthen fetches the descriptors from the DPI queue andalso the corresponding packets. After transmission iscomplete, the DPI core marks the descriptors completein the OS queue and raises an IPI to notify the OS core.

The packet interception mechanism is sufficient toprotect low threat workloads on OS cores from inadver-tently reading uninspected packet data on the receivepath. In order to maliciously transmit uninspecteddata, the OS core software would need to modify thepacket data after inspection completes but before thedata DMA to the NIC completes – a time window thatcannot be predicted in general.

To provide higher levels of protection, Ally couldoperate in conjunction with slightly enhanced NIC hard-ware. Current NICs compute packet data checksums,and could be extended to write checksum values tothe receive descriptor queue, and verify DPI-computedchecksum values that it reads from the transmit de-scriptor queue, thereby enabling full detection of packetcontent tampering. Alternatively, a naive software-onlysolution to enhance protection for high threat workloadsis to copy packet data between OS and DPI memory,incurring higher overheads.

4.4 User-Kernel Space InteractionAlly supports the execution of a complete operating

system in DPI cores with separate kernel-level and user-

5

(a) Descriptors are made available by the OSin the OS descriptor queue. OS increases OSTail Pointer

(b) DPI gets interrupted by the OS core, copiesdescriptors, and updates DPI Tail Pointer

(c) NIC copies received packets and descrip-tors, marking descriptors as completed in theDPI queue. NIC updates DPI Head Pointer.DPI processes packets

(d) DPI marks descriptors in the OS queue ascompleted. DPI updates OS Head Pointer.The OS can thus consume the packets andclean the OS queue

Figure 4: Receive: Evolution of the descriptorqueues in Ally

level execution modes. In this case, the DPI enginecan be deployed as a user space application. Thishas several advantages in terms of programmability,allowing the use of standard libraries and debuggingtools, but requires efficient kernel space - user spacecommunication.

The basic Ally services (packet interception) run inthe kernel, while packet inspection is performed in userspace. We developed an efficient software mechanismto deliver intercepted packets from kernel space to user

Table 2: QEMU ResultsTCP STREAM TCP MAERTS TCP RR

Instructions/pct 21.33 71.99 246.40Interrupts/pct 0.04 0.39 1.98Data movements 19.34 31.88 63.05(bytes/pct)

space on DPI cores leveraging Ally’s operating model.Since all code on DPI cores is trusted, the DPI cores canfetch packet data directly from the kernel-level packetbuffers pointed to by the NIC TX and RX descriptors.This is accomplished by mapping the whole kernel spaceinto DPI address space (e.g. using /dev/mem in Linux).In addition, we use a socket-like mechanism (netlink inLinux) to communicate updates of the descriptor headand tail pointers to the DPI engine. For example, whenthe OS tries to transmit packets by updating the tailregister of the virtual TX queue, the updated valueis sent as a netlink message to the user space packetinspection software, which inspects all packets enqueuedbetween the old and new value of the tail register.

5. EVALUATIONWe employ a dual strategy to evaluate Ally. First, we

validate the functionality of the proposed hardware andfirmware modifications by adding Ally enhancementsto an x86 full system emulator (QEMU [4]). Second,to assess the performance of Ally on real hardware,we built a Linux-based prototype. The prototypemodifies Linux to emulate the key functionality of Allyneeded for packet interception. The resulting systemhas performance that should approach the performanceof real Ally hardware.

For both the QEMU-based and Linux-based pro-totypes, we use the Netperf micro-benchmarks (www.netperf.org) as the initial application workload run-ning on an OS core. We use the two Netperf streamingtests, TCP STREAM for transmit and TCP MAERTSfor receive, and the request-reply TCP RR test. In allcases, an additional machine running unmodified Linuxwas used as the network client. Finally, we evaluateAlly performance using SPECWeb2005, a more realisticworkload scenario than Netperf in which applicationprocessing is significant relative to packet processing.

5.1 Full System EmulationThe QEMU-based system successfully boots unmodi-

fied Debian Linux onto a subset of the emulated CPUs,while retaining one core for exclusive use by the DPIengine. Through modifications of the BIOS of QEMU,Linux does not detect the existence of the reservedDPI core. In addition, the DPI core executes customcode loaded from the modified BIOS. This custom codeintercepts all packets between Linux and the emulated

6

Intel Pro/1000 Ethernet device in order to verify ourqueue virtualization technique. After intercepting eachpacket, the custom code simply forwards it to the NICor the OS core. To provide insight into the basic cost ofpacket interception, the custom code on the DPI corewas designed only to intercept packets and does notperform actual DPI processing in this initial test.

We extended QEMU to measure significant eventsfor Ally. As shown in Table 2, we measured thebaseline overhead of Ally in terms of the number ofextra instructions, memory accesses, and interrupts perpacket. By packet, we mean an Ethernet frame that istransmitted or received over the external network link.The size of a packet can range from minimum Ethernetframe size to full size 1514-byte frame. All results shownin this Table measure cost on the DPI-core only. Usingthree Netperf benchmarks, the results indicate that Allyrequires at most a few hundred instructions per packet,and tens of bytes of data movement for MMIO andreceive descriptors. This overhead would typically bedwarfed by the processing required for an actual DPIapplication.

5.2 Linux-Based PrototypeTo estimate the performance of Ally on real hardware

instead of QEMU emulated hardware, we extendedLinux with kernel software modifications that emulateAlly hardware extensions. We favor this evaluationapproach over using an architectural simulator. Thisis because the additional latency of Ally’s hardwareoperations such as MMIO redirection is small (Sec-tion 4.3), and so software emulation of these operationsis likely to be conservative. In addition, architecturesimulators are very slow, precluding accurate evaluationfor realistic workloads having non-trivial runtimes. Incontrast, our approach enables accurate evaluation ofAlly performance for realistic workloads while fullyincorporating the software and hardware complexity ofa real system.

Our prototype extends Linux 2.6.28.7 running onan Intel Core 2 Duo (2GHz) and an Intel PRO/1000gigabit Ethernet card. The prototype emulates coresequestration by pinning all the OS activity on one corewith Linux CPUSETS, and the DPI environment onthe other core. It is configured to direct NIC interruptsto the DPI core, and we manually inserted instructionsinto the NIC driver source code to generate IPIs whenvirtualized registers are accessed. Another change is forthe NIC driver to notify the DPI application when theDPI core receives an IPI. This requires sending a Netlinkmessage from kernel to user space. This somewhatviolates the isolation property since the DPI core usesthe Netlink Service provided by the Linux kernel.However, further investigation shows that this addsalmost no overhead to the Linux kernel. In this way,

the prototype emulates in software the redirection of theaccesses to the NIC registers that would be provided inthe Ally hardware by the modified MMU. The prototypeprovides timing-faithful results not possible to obtainwith QEMU.

To evaluate Ally in the context of real packet inspec-tion software, we deployed Snort (www.snort.org) onthe DPI core. Snort is a popular open source IntrusionDetection/Prevention System. All packets interceptedon the DPI core are processed by Snort before beingforwarded to the NIC or the OS core. We modifiedSnort for Ally to use our kernel-user communicationmechanism described in Section 4.4.

While Ally can use an arbitrary number of cores, welimited experiments to one core for DPI. This is dueto the single threaded nature of Snort. Parallelizationtechniques could be used to enable multicore scalabilityof Snort, but this is out of the scope of this paper.

We compare our prototype, Ally, with two alternativesystems. The first, Linux, deploys Snort on unmodifiedLinux. Linux is configured such that NIC interruptsand Snort are pinned on one core (“DPI core”), whileall other activities are pinned on a second core (“OScore”). We stress that this Linux system is not a viablealternative to Ally since it provides no transparencywhatsoever. We use it only to verify that Ally doesnot increase processing costs significantly compared tonative Linux, and to evaluate the benefits of Ally’suser-kernel interaction mechanism. For the secondsystem, Xen, we deployed two virtual machines (VMs),Dom0 and DomU. Each domain has one virtual CPU(VCPU). Snort runs in Dom0, which intercepts allpackets to/from DomU. The Dom0 VCPU acts as DPIcore while the DomU VCPU acts as OS core. Since thenumber of VCPUs is equal to the number of physicalcores, we use core pinning between virtual and physicalCPU to avoid performance loss due to VCPU schedul-ing, context switching and migration. In addition,Dom0 and DomU are paravirtualized (PV) domains,which continue to provide higher performance than fullyvirtualized (HVM) domains even with modern hardwaresupport for virtualization. The resulting configurationwith VCPU pinning and PV domains is arguably themost efficient setting we can use in the current Xenimplementation.

Figure 5 shows the throughput achieved by Ally,Linux, and Xen for the Netperf benchmarks. For thestreaming workloads, the DPI core is saturated for allthree systems and limits the achieved throughput. ForAlly and Linux this is mainly a result of choosing ahighly CPU-intensive ruleset for Snort as our exampleuse case. Less CPU-intensive packet processing usecases – e.g., passive monitoring, load balancing – wouldlikely achieve full line rate bandwidths. Ally achieves25-32% higher throughput than Linux for the streaming

7

0

1000

2000

3000

4000

5000

6000

7000

0

100

200

300

400

500

600

Ally Linux Xen Ally Linux Xen Ally Linux Xen

TCP_MAERTS TCP_STREAM TCP_RR

Tran

sactions/s

Mb/s

Figure 5: Throughput for Ally, Linux, and Xenrunning with Snort. Throughput is in Mb/s forTCP MAERTS and TCP STREAM, left y-axis,while TCP RR uses transactions/s, right y-axis

benchmarks (TCP MAERTS and TCP STREAM), andboth systems far exceed the performance of Xen. Thelow overhead of Ally’s packet interception mechanismleaves more CPU cycles available for Snort to processpackets when the DPI core is saturated. In contrast,the transaction rate in TCP RR is limited by theround-trip latency instead of the CPU cycles Snortcan consume. Ally’s queue virtualization adds a smalldelay to the transmit and receive paths compared toLinux. As a result, Ally achieves similar but slightlylower TCP RR transaction rates than Linux, and bothsystems significantly outperform Xen.

To better understand the reasons behind these high-level performance results, we used OProfile to analyzethe breakdown of CPU processing cost on each core, asdescribed next.

5.3 Per-Core Processing CostsFigure 6 shows the CPU cycles consumed on DPI core

and OS core to process each packet on each system forthe three Netperf benchmarks. Since Snort is the onlyuser-level process running on the DPI core, we furtherdivide DPI core cycles into DPI (user) and DPI (kernel)to explicitly show Snort overhead. The results showthat Snort (i.e. DPI-user) consumes a similar numberof cycles per packet across the three systems. Thus,Snort performs about the same amount of work perpacket in all three systems. Further studies (not shown)revealed that most cycles are consumed by Snort’spattern matching algorithm, which is used to comparethe packet payload against all rules. Snort consumesthe majority of CPU cycles in TCP MAERTS andTCP STREAM since many packets with large payloadare inspected. In contrast, for TCP RR, only one small

0

20

40

60

80

100

120



cycl

es/

pac

ket

* 1

03

OS core DPI core (kernel) DPI core (user)

Figure 6: CPU cycles (OS core and DPI core)for Ally, Linux, and Xen running with Snort

Table 3: Classes grouping Linux functions

Class DescriptionPacket Interception Functions used to intercept packetsMemory copy Functions used to copy data between

kernel and user spaceNetwork Transmit/receive path related functionsHypervisor Functions in XenOther e.g. time, scheduling

packet is inspected at a time.The three systems have significantly different process-

ing costs for DPI (kernel) and OS core. Ally and Linuxhave roughly similar OS core costs which are much lowerthan for Xen. Ally has lower DPI (kernel) costs thanLinux and Xen for all three benchmarks.

To analyze these differences, we need a further break-down of processing costs for DPI (kernel) and OS core.We thus group source code functions into the costcategories shown in Table 3. Functions are placed intothese categories based on a static analysis of the Linuxsource tree and on an analysis of the dynamic call graphof kernel execution. The following subsections breakdown the cost of DPI (kernel) and OS core according tothese function categories.

5.3.1 DPI Core CostsFigure 7 shows the breakdown of DPI (Kernel) cost

per packet using the categories of Table 3. We are in-terested in the comparison of packet interception mech-anism between Ally and Linux and their correspondingcosts on memory copying. In Linux, Snort normally usesthe libipq mechanism to fetch packets from kernel space.This mechanism leverages the Linux iptables mechanismto intercept and enqueue network packets at kernel-level into a special queue. From there, the packets are

8

0

10

20

30

40

50

60

70

80



cycl

es/

pac

ket

* 1

03

Packet Interception Memory Copy Hypervisor Network Other

Figure 7: CPU cycles for DPI core (Kernel) forAlly, Linux, and Xen

delivered to the user-level Snort application using theLinux netlink mechanism, which provides a standardsockets interface for user-kernel communication. Eachnetlink message contains one packet and is copied fromkernel space to user space. After inspection, Snort sendsa verdict back into the kernel specifying how to deal withthose packets (ACCEPT or DROP). Packets acceptedby Snort continue to traverse the network stack. Thislibipq mechanism is used in the Linux and Xen systems.

As shown in Figure 7, Linux has higher packet inter-ception overhead than Ally for the streaming workloads.For libipq used in Linux, the number of netlink messagesis equal to the number of packets that are analyzed bySnort. Therefore, Linux has similar netlink overheadper packet in each benchmark. For Ally, the queuevirtualization mechanism generates a netlink messageeach time the tail/head pointers are updated. Forthe streaming benchmarks, each update corresponds tomultiple packets instead of just one packet, leading tolower overhead for these workloads than in Linux.

In addition, Ally has the lowest overhead in thememory copy category. The reason is that in Linuxthe libipq uses the netlink mechanism to transportentire packets to user space whereas in Ally the netlinkconnection is only used to transport the descriptorhead and tail pointers to Snort and all packet dataare directly accessed via memory mapping. For allbenchmarks, Ally has no overhead in the networkcategory. This indicates the overhead incurred on theDPI core is purely for packet interception.

For the TCP RR benchmark, Ally has higher packetinterception overhead than Linux, for two reasons.First, the number of updates of the tail/head pointers isthe same as the number of packets, since only one packetis sent or received at a time. Therefore, the number ofnetlink messages is similar for Ally and Linux. Second,

queue virtualization requires updating the real head andtail pointers by performing costly MMIO accesses whoselatency cannot be hidden in TCP RR.

In all cases Xen has much higher DPI core processingcost than Ally and Linux. Most of the hypervisoroverhead is due to the expensive grant table mechanismwhich is used to remap packets from one domain tothe other. Even the network processing within Dom0greatly exceeds that of Linux and Ally because of thecomplex backend driver and bridge mechanism addedin Dom0. Overall this result reflects a relatively highcost of packet interception and network interface virtu-alization in the current Xen implementation. However,virtualization costs will likely be reduced by futureenhancements to Xen [9,10]. In addition, optimizationssimilar to those we added to Ally could potentially beapplied to Dom0 to reduce the cost of intercepting andsending packets to the user-level Snort application inDom0, but this is out of scope of this paper.

In summary, Ally has lower overhead for the stream-ing workloads due to the queue virtualization mech-anism and the reduction in memory copies due tothe kernel space to user space communication mech-anism. Of these two mechanisms, the queue virtual-ization mechanism contributes a larger portion of theoverhead reduction. For the latency sensitive work-load (TCP RR), Ally’s mechanisms provide comparablepacket processing latency compared to native Linux. Inall cases, Ally outperforms Xen due to its hypervisoroverheads and less efficient packet interception.

5.3.2 OS Core CostsRecall from Figure 6 that OS core costs with Ally

are similar to the costs with unmodified Linux. In factthe costs with Ally are slightly lower. However, thedifference is less than 15%. We found that several causesfor this difference. First, in Ally, the OS core accessesvirtualized NIC registers rather than performing theMMIO accesses which incur higher latency. (Allyperforms these MMIO accesses on the DPI core, andtheir cost is reflected in the results of Figures 7).Second, Ally incurs a little more overhead in netlinkfunctions due to the service shared by the DPI core.Lastly, for the receive-heavy TCP MAERTS workloadin Ally, the DPI core processing can fetch packet datainto the processor cache hierarchy in advance of OS coreprocessing of the same packets, potentially reducing thecost of OS core processing.

5.4 SPECWeb ResultsWhile Netperf is useful for understanding system

behavior in network-intensive scenarios that place highrequirements on packet processing services, most realworkloads present a more balanced mix of computingand networking. We ran experiments using SPEC-Web2005 to evaluate Ally with a realistic workload

9

0

20

40

60

80

100

120


Bank Ecommerce Support

cycl

es/

req

ue

st *

10

6

Figure 8: Processing costs for SPECWeb2005workload

0

2

4

6

8

10

12

14

16

18

20

Ally Native Ally Native Ally Native

Bank Ecommerce Support

Cac

he

mis

ses/

req

ue

st *

10

3

OS core (kernel) OS core (user) DPI core (kernel) DPI core (user)

Figure 9: Cache misses for the SPECWeb2005workloads

presenting a mix of requirements for each resource type.Each benchmark – Bank, Ecommerce and Support – isrun with 100 simultaneous sessions. Apache is run onOS core (user) while Snort is run on DPI core (user).

The results in Figure 8 show that Support has thehighest number of packets per request which in turn in-creases the per-request processing cost of Snort. Trafficencryption in Bank makes Apache consume much morecycles among all cases. The kernel cost in Ally showsthat its packet interception mechanism is more scalablethan Xen in which case the hypervisor cost increases alot under higher number of packets in Support.

To study the potential cache sharing and interferenceeffects, we run Apache with and without Snort usingthe SPECweb workloads. Results in Figure 9 show thatApache has a similar number of cache misses runningwith or without Snort. The cache misses of OS core

(kernel) in Ally are a little more than in Linux possiblybecause of resource contention due to the Netlink serviceshared by the DPI core. The majority of remainingmisses are due to kernel page allocation. The cachemisses of Snort (“DPI core (user)”) are mainly due topacket decoding and payload inspection, since receivedpacket data is first touched in our solution by Snortrather than the Linux kernel (zero-copy for inspection).The DPI core (kernel) cache misses in Ally are dueto descriptor queue synchronization, as descriptors areinvalidated in the cache when the NIC DMAs themodified descriptors into the main memory. We alsofound that there is almost no difference between theaverage request latencies of the two cases. The maincontribution to the latency is Apache, not of the packetprocessing in Snort.

6. RELATED WORKConventional DPI appliances, and other network

packet processing appliances such as firewalls, are typ-ically implemented as special-purpose devices at a dat-acenter gateway [11]. Ally enables adding this func-tionality to standard server platforms enabling dis-tributed deployment of local packet processing in soft-ware. Previous work advocated implementing thesefunctionalities in virtual machines (VMs) running ona hypervisor along with application VMs, and usingTrusted Computing (TC) hardware in modern proces-sors to ensure that the I/O appliances are runningauthorized code [2]. Ally differs from these approachesby removing dependence on a hypervisor. By main-taining independence from host software, Ally providesa general solution serving a wide variety of customerswith diverse OSes and hypervisor deployments, even ina shared datacenter. Moreover, the core sequestrationmodel avoids the complexities of constructing chains ofattestation using TCB.

In the context of business laptops, the Intel Vprotechnology [12] provides limited packet filtering capabil-ities in the NIC. Ally provides a much more flexible andpowerful infrastructure applicable to packet analysis orother compute-intensive system management functions.

Similarly to Ally, a hypervisor could be used toprovide system partitioning (OS-partition and DPI-partition). Hypervisors can additionally host virtualmachines called driver domains to transparently in-tercept data exchanged between running guest VMsand I/O devices [13, 14]. Additional packet processingsuch as DPI could be performed in software in thesedriver domains [15]. In comparison to using softwarevirtualization for partitioning and I/O interception, thehardware/software approach provided by Ally has theconceptual advantage that the “OS-partition” can runarbitrary user code. In particular, the OS-partitioncan itself run a hypervisor. Without hardware par-

10

titioning, this would require full support for nestedlayers of virtualization. Recent work indicates that suchnesting may be feasible with reasonable performancedegradation [16]. However, this solution achieved goodperformance for I/O only by bypassing the virtualiza-tion layers, precluding packet interception. In addition,the hardware-based isolation mechanisms used in Allyare arguably more resistant to attacks compared topotentially buggy hypervisor software. This is thecase even if Trusted Computing [17] techniques areused to verify the authenticity of binary executables.Some researchers have argued recently that hardwarepartitioning offers several significant advantages oversoftware hypervisors [18]. Ally strikes a balance by fullysupporting OS core use of hypervisors (and hardwaresupport for virtualization like Intel VT-x and ExtendedPage Tables) while providing a simple form of hardwarepartitioning that sets aside DPI cores as needed fortransparent datacenter-wide distributed packet process-ing.

Some techniques that Ally uses for lightweight packetinterception could also be adapted for driver domainsin software virtualized environments. For example, arunning guest VM that has direct access to a NIC(or virtual context of a NIC) could be transparentlyswitched on-the-fly by the hypervisor to use a driverdomain that intercepts and inspects packets by virtual-izing the NIC queue using similar techniques as Ally.

The BitVisor hypervisor aims to provide transparentI/O services for a single guest VM using device queuevirtualizations that are similar to our approach [19].BitVisor differs from Ally in not using hardware coresequestering, and therefore cannot support a full op-erating system and application stack on a dedicatedDPI core, or run a hypervisor with multiple guestOSes on the OS cores. TwinDrivers [7] also usesqueue virtualization, but places this functionality in ahypervisor to accelerate NIC virtualization. Unlike Ally,TwinDrivers does not try to provide packet processingservices, and thus lacks core sequestering and supportfor full software stacks for general packet processingservices.

The use of a high privileged core for security andmonitoring has been proposed for a system called IN-DRA [20] and by Chen et al. in the context of log-basedarchitectures [21]. In these systems, the privileged coreis called “resurrector” or “lifeguard”. Ally extends thisfunctionality providing efficient I/O analysis capabilitiesto the privileged core.

7. CONCLUSIONSAlly enables software-independent and transparent

deployment of packet processing services like DPI ona multicore processor. Our evaluation on a systememulator and a prototype Ally system demonstrate

the feasibility of the Ally approach and analyze indetail the contributions of the design components tooverall performance. Our results demonstrate that Allyis competitive in efficiency with deploying DPI non-transparently on native Linux. Ally is more efficientthan a system that uses Xen software virtualizationto achieve transparent packet inspection. Moreover,Ally is transparent to and supports the use of arbitraryoperating systems and/or hypervisors running in theOS core partition, and is compatible with the use ofhardware support for virtualization like Intel VT-i.

We plan to extend this work to support and evaluatepacket processing services using emerging multi-contextNICs (PCIe SR-IOV NICs) [22]. Given the low cost ofqueue virtualization observed in our current prototype,it is highly likely that the cost of Ally’s interceptionmechanisms will be a function dominated by the linkbandwidth or packet rate, and not by the numberof contexts. In addition, we plan to evaluate Allyfor larger-scale multicore systems. Ally should easilyleverage future processors equipped with several tens ofcores by distributing the packet inspection engine.

Ally and complementary technologies are bringinggreater packet processing capabilities to general purposeservers with powerful programmable CPUs and largeamounts of memory, enabling new rich network services.This motivates many potential future research inves-tigations in the context of large datacenters to studyproblems including secure deployment, configuration,and coordination of an ensemble of Ally instances andservices.

8. REFERENCES[1] Deep packet inspection: 2009 market forecast.

Lightreading Insider, 8(11), December 2008.[2] C. Dixon, H. Uppal, V. Brajkovic, D. Brandon,

T. Anderson, and A. Krishnamurthy. Ettm: ascalable fault tolerant network manager. InProceedings of the 8th USENIX conference onNetworked systems design and implementation,2011.

[3] Intel. Introducing Intel many integrated corearchitecture, Online:http://www.intel.com/technology/architecture-silicon/mic/index.htm.

[4] F. Bellard. Qemu, a fast and portable dynamictranslator. In ATEC ’05: Proceedings of theannual conference on USENIX Annual TechnicalConference, pages 41–41. USENIX Association,2005.

[5] Intel Active Management Technology. Online:http://www.intel.com/technology/platform-technology/intel-amt.

[6] HP Integrated Lights-Out (iLO) Standard.Online: http://www.hp.com/go/ilo.

11

[7] A. Menon, S. Schubert, and W. Zwaenepoel.Twindrivers: semi-automatic derivation of fastand safe hypervisor network drivers from guest osdrivers. In Proceeding of the 14th internationalconference on Architectural support forprogramming languages and operating systems,ASPLOS ’09, pages 301–312, New York, NY,USA, 2009. ACM.

[8] V. Ganapathy, M. Renzelmann, A. Balakrishnan,M. Swift, and S. Jha. The design andimplementation of microdrivers. In Proceedings ofthe 13th international conference on Architecturalsupport for programming languages and operatingsystems, ASPLOS XIII, pages 168–178, New York,NY, USA, 2008. ACM.

[9] J. Santos, Y. Turner, G. Janakiraman, andI. Pratt. Bridging the gap between software andhardware techniques for i/o virtualization. InUSENIX 2008 Annual Technical Conference onAnnual Technical Conference, pages 29–42,Berkeley, CA, USA, 2008. USENIX Association.

[10] K. Ram, J. Santos, Y. Turner, A. Cox, andS. Rixner. Achieving 10 gb/s using safe andtransparent network interface virtualization. InProceedings of the 2009 ACM SIGPLAN/SIGOPSinternational conference on Virtual executionenvironments, VEE ’09, pages 61–70, New York,NY, USA, 2009. ACM.

[11] CloudShield Technologies. Online:http://www.cloudshield.com.

[12] Intel Vpro Technology. Online:http://www.intel.com/vpro.

[13] J. LeVasseur, V. Uhlig, J. Stoess, and S. Gotz.Unmodified device driver reuse and improvedsystem dependability via virtual machines. InProceedings of the 6th Symposium on OperatingSystems Design and Implementation, SanFrancisco, CA, December 2004.

[14] K. Fraser, S. Hand, R. Neugebauer, I. Pratt,Andrew Warfield, and Mark Williams. Safehardware access with the Xen virtual machinemonitor. In OASIS ’04: Proceedings of the 1stWorkshop on Operating System and ArchitecturalSupport for the on demand IT Infrastructure,October 2004.

[15] D. McAuley and R. Neugebauer. A case forvirtual channel processors. In NICELI ’03:Proceedings of the ACM SIGCOMM workshop onNetwork-I/O convergence, pages 237–242, NewYork, NY, 2003. ACM.

[16] M. Ben-Yehuda, M. Day, Z. Dubitzky, M. Factor,N. Har’El, A. Gordon, A. Liguori, O. Wasserman,and B. Yassour. The turtles project: Design andimplementation of nested virtualization. In OSDI2010.

[17] J. McCune, B. Parno, A. Perrig, M. Reiter, andH. Isozaki. Flicker: an execution infrastructure fortcb minimization. In Eurosys ’08: Proceedings ofthe 3rd ACM SIGOPS/EuroSys EuropeanConference on Computer Systems 2008, 2008.

[18] E. Keller, J. Szefer, J. Rexford, and R. Lee.Nohype: virtualized cloud infrastructure withoutthe virtualization. In ISCA ’10: Proceedings of the37th annual international symposium onComputer architecture, 2010.

[19] T. Shinagawa, H. Eiraku, K. Tanimoto,K. Omote, S. Hasegawa, T. Horie, M. Hirano,K. Kourai, Y. Oyama, E. Kawai, K. Kono,S. Chiba, Y. Shinjo, and K. Kato. Bitvisor: a thinhypervisor for enforcing i/o device security. InVEE ’09: Proceedings of the 2009 ACMSIGPLAN/SIGOPS international conference onVirtual execution environments, pages 121–130,2009.

[20] Weidong Shi, Hsien-Hsin S. Lee, Laura Falk, andMrinmoy Ghosh. An integrated framework fordependable and revivable architectures usingmulticore processors. In ISCA ’06: Proceedings ofthe 33rd annual international symposium onComputer Architecture, pages 102–113, 2006.

[21] S. Chen, M. Kozuch, T. Strigkos, B. Falsafi,P. Gibbons, T. Mowry, V. Ramachandran,O. Ruwase, M. Ryan, and E. Vlachos. Flexiblehardware acceleration for instruction-grainprogram monitoring. In ISCA ’08: Proceedings ofthe 35th International Symposium on ComputerArchitecture, 2008.

[22] Y. Dong, Z. Yu, and G. Rose. SR-IOV networkingin Xen: Architecture, design and implementation.In WIOV ’08: Proceedings of the 1st Workshop onI/O Virtualization.

12

Documents

Ally: OS-Transparent Packet Inspection using Sequestered Cores · for both private Enterprise datacenters and public cloud computing datacenters, which support a fast-changing multitude