12
USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 13 Hyper-Switch: A Scalable Software Virtual Switching Architecture Kaushik Kumar Ram, Alan L. Cox, Mehul Chadha and Scott Rixner Rice University Abstract In virtualized datacenters, the last hop switching hap- pens inside a server. As the number of virtual machines hosted on the server goes up, the last hop switch can be a performance bottleneck. This paper presents the Hyper-Switch, a highly efficient and scalable software- based network switch for virtualization platforms that support driver domains. It combines the best of the ex- isting I/O virtualization architectures by hosting device drivers in a driver domain to isolate faults and placing the packet switch in the hypervisor for efficiency. In ad- dition, this paper presents several optimizations that en- hance performance. They include virtual machine (VM) state-aware batching of packets to mitigate the costs of hypervisor entries and guest notifications, preemptive copying and immediate notification of blocked VMs to reduce packet arrival latency, and, whenever possible, packet processing is dynamically offloaded to idle pro- cessor cores. These optimizations enable efficient packet processing, better utilization of the available CPU re- sources, and higher concurrency. We implemented a Hyper-Switch prototype in the Xen virtualization platform. This prototype’s performance was then compared to Xen’s default network I/O archi- tecture and KVM’s vhost-net architecture. The Hyper- Switch prototype performed better than both, especially for inter-VM network communication. For instance, in one scalability experiment measuring aggregate inter- VM network throughput, the Hyper-Switch achieved a peak of 81 Gbps as compared to only 31 Gbps under Xen and 47 Gbps under KVM. 1 Introduction Machine virtualization is now used extensively in data- centers. This has led to considerable change to both the datacenter network and the I/O subsystem within virtual- ized servers. In particular, the communication endpoints within the datacenter are now virtual machines (VMs), not physical servers. Consequently, the datacenter net- work now extends into the server and last hop switching occurs within the physical server. At the same time, thanks to increasing core counts on processors, server VM densities is on the rise. This trend is placing enormous pressure on the network I/O subsystem and the last-hop virtual switch to support effi- cient communication—especially between VMs—in vir- tualized servers. There are many I/O architectures for network commu- nication in virtualized systems. Of these, software device virtualization is the most widely used. This preference for software over specialized hardware devices is due in part to the rich set of features—including security, isola- tion, and mobility—that the software solutions offer. The software solutions can be further divided into driver domain and hypervisor-based architectures. Driver domains are dedicated VMs that host the drivers that are used to access the physical devices. It provides a safe execution environment for the device drivers. Ar- guably, the hypervisors that support driver domains are more robust and fault tolerant, as compared to the alter- nate solutions that locate the device drivers within the hypervisor. However, on the flip side, they incur signifi- cant software overheads that not only reduce the achiev- able I/O performance but also severely limit I/O scalabil- ity [29, 31]. In existing I/O architectures, the virtual switch is im- plemented inside the same software domain where the virtual devices are implemented and the device drivers are hosted. For instance, all of these components are im- plemented inside a driver domain in Xen [13] and the hypervisor in KVM [26]. This collocation is purely a matter of convenience since packets must be switched when they are moved between the virtual devices and the device drivers. In this paper, we introduce the Hyper-Switch, which challenges the existing convention by separating the vir- tual switch from the domain that hosts the device drivers. The Hyper-Switch is a highly efficient and scalable software-based switch for virtualization platforms that support driver domains. In particular, the hypervisor in- cludes the data plane of a flow-based software switch, while the driver domain continues to safely host the de- vice drivers. Since the data plane is small relative to the size of the switch control plane, it does not significantly increase the size of the hypervisor or the platform’s over- all trusted computing base (TCB). The Hyper-Switch ex- plores a new point in the virtual switching design space.

Hyper-Switch: A Scalable Software Virtual Switching Architecture

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hyper-Switch: A Scalable Software Virtual Switching Architecture

USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 13

Hyper-Switch: A Scalable Software Virtual Switching Architecture

Kaushik Kumar Ram, Alan L. Cox, Mehul Chadha and Scott Rixner

Rice University

AbstractIn virtualized datacenters, the last hop switching hap-

pens inside a server. As the number of virtual machineshosted on the server goes up, the last hop switch canbe a performance bottleneck. This paper presents theHyper-Switch, a highly efficient and scalable software-based network switch for virtualization platforms thatsupport driver domains. It combines the best of the ex-isting I/O virtualization architectures by hosting devicedrivers in a driver domain to isolate faults and placingthe packet switch in the hypervisor for efficiency. In ad-dition, this paper presents several optimizations that en-hance performance. They include virtual machine (VM)state-aware batching of packets to mitigate the costs ofhypervisor entries and guest notifications, preemptivecopying and immediate notification of blocked VMs toreduce packet arrival latency, and, whenever possible,packet processing is dynamically offloaded to idle pro-cessor cores. These optimizations enable efficient packetprocessing, better utilization of the available CPU re-sources, and higher concurrency.

We implemented a Hyper-Switch prototype in the Xenvirtualization platform. This prototype’s performancewas then compared to Xen’s default network I/O archi-tecture and KVM’s vhost-net architecture. The Hyper-Switch prototype performed better than both, especiallyfor inter-VM network communication. For instance, inone scalability experiment measuring aggregate inter-VM network throughput, the Hyper-Switch achieved apeak of ∼81 Gbps as compared to only ∼31 Gbps underXen and ∼47 Gbps under KVM.

1 Introduction

Machine virtualization is now used extensively in data-centers. This has led to considerable change to both thedatacenter network and the I/O subsystem within virtual-ized servers. In particular, the communication endpointswithin the datacenter are now virtual machines (VMs),not physical servers. Consequently, the datacenter net-work now extends into the server and last hop switchingoccurs within the physical server.

At the same time, thanks to increasing core countson processors, server VM densities is on the rise. This

trend is placing enormous pressure on the network I/Osubsystem and the last-hop virtual switch to support effi-cient communication—especially between VMs—in vir-tualized servers.

There are many I/O architectures for network commu-nication in virtualized systems. Of these, software devicevirtualization is the most widely used. This preferencefor software over specialized hardware devices is due inpart to the rich set of features—including security, isola-tion, and mobility—that the software solutions offer.

The software solutions can be further dividedinto driver domain and hypervisor-based architectures.Driver domains are dedicated VMs that host the driversthat are used to access the physical devices. It providesa safe execution environment for the device drivers. Ar-guably, the hypervisors that support driver domains aremore robust and fault tolerant, as compared to the alter-nate solutions that locate the device drivers within thehypervisor. However, on the flip side, they incur signifi-cant software overheads that not only reduce the achiev-able I/O performance but also severely limit I/O scalabil-ity [29, 31].

In existing I/O architectures, the virtual switch is im-plemented inside the same software domain where thevirtual devices are implemented and the device driversare hosted. For instance, all of these components are im-plemented inside a driver domain in Xen [13] and thehypervisor in KVM [26]. This collocation is purely amatter of convenience since packets must be switchedwhen they are moved between the virtual devices and thedevice drivers.

In this paper, we introduce the Hyper-Switch, whichchallenges the existing convention by separating the vir-tual switch from the domain that hosts the device drivers.The Hyper-Switch is a highly efficient and scalablesoftware-based switch for virtualization platforms thatsupport driver domains. In particular, the hypervisor in-cludes the data plane of a flow-based software switch,while the driver domain continues to safely host the de-vice drivers. Since the data plane is small relative to thesize of the switch control plane, it does not significantlyincrease the size of the hypervisor or the platform’s over-all trusted computing base (TCB). The Hyper-Switch ex-plores a new point in the virtual switching design space.

1

Page 2: Hyper-Switch: A Scalable Software Virtual Switching Architecture

14 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association

Another contribution of this paper is a set of optimiza-tions that increase performance. They enable the Hyper-Switch to efficiently support both bulk and latency sensi-tive network traffic. They include VM state-aware batch-ing of packets to mitigate the costs of hypervisor entrieson the transmit side and guest notifications on the receiveside. Preemptive copying is employed at the receivingVM, when it is being notified of packet arrival, to reducelatency. Further, whenever possible, packet processingis dynamically offloaded to idle processor cores. The of-floading is performed using a low-overhead mechanismthat takes into account CPU cache locality, especially inNUMA systems.

These optimizations enable efficient packet process-ing, better utilization of the available CPU resources, andhigher concurrency. They take advantage of the Hyper-Switch data plane’s integration within the hypervisor andits proximity to the scheduler. As a result, the Hyper-Switch enables much improved and scalable networkperformance, while maintaining the robustness and faulttolerance that derive from the use of driver domains. Fur-ther, we believe that these optimizations can and shouldbe a part of any virtual switching solution that aims todeliver high performance.

We evaluated the Hyper-Switch using a prototypethat was implemented in the Xen virtualization plat-form [4]. The prototype was built by modifying OpenvSwitch [24], a multi-layer software switch for com-modity servers. In this evaluation, the Hyper-Switch’sperformance was compared to that of KVM’s vhost-netand Xen’s default network I/O architectures. The resultsshow that the Hyper-Switch enables much better scala-bility and peak throughput than both of these existing ar-chitectures.

The rest of this paper is organized as follows. Sec-tion 2 further motivates the Hyper-Switch by discussingsome of the issues with existing architectures. Section 3explains the Hyper-Switch’s design. Section 4 describesthe implementation of the Hyper-Switch prototype. Sec-tion 5 presents a detailed evaluation of the Hyper-Switch.Section 6 discusses related work. Finally, Section 7 sum-marizes the conclusions.

2 Motivation

The need for efficient and scalable network communica-tion within virtualized servers is increasing. Intel alreadyclaims to have an architecture that can scale to 1000 coreson a single chip [15]. Furthermore, the number of coreson a chip is predicted to grow to 64 in a few years andto 256–512 by the end of the decade [12]. If this lastprediction is borne out, then in 2020 a single 1U serverwill have as many cores as an entire rack of servers doestoday.

In addition, communication between servers withinthe same datacenter already accounts for a significantfraction of the datacenter’s total network traffic [14].Moreover, Benson et al.’s study of multiple datacenternetworks reported that 80% of the traffic originating atservers in cloud datacenters never leaves the rack [5]. Ifthe predictions for the growing number of cores cometo pass, then a rack of servers may be replaced by VMswithin a single physical server, and the network trafficthat today never leaves the rack may instead never leavethe server. Consequently, the Hyper-Switch has beenheavily optimized to enable high performance inter-VMcommunication.

Modern multi-core systems enable concurrent pro-cessing of network packets. Under Xen’s default net-work architecture, the driver domain can run in parallelto the transmitting and receiving VMs. Consequently, itis possible to perform packet switching in parallel withpacket transmission and reception. However, there areseveral fundamental problems with traditional driver do-main architectures that limit I/O performance scalability.Fundamentally, the driver domain must be scheduled torun whenever packets are waiting to be processed. Thismight involve scheduling multiple virtual processors de-pending on the number of threads used for packet pro-cessing in the driver domain. As a result, schedulingoverheads are incurred while processing network pack-ets. Further, the driver domain must be scheduled in atimely manner to avoid unpredictable delays in the pro-cessing of network packets.

Today, it is standard practice in real-world virtualiza-tion deployments to dedicate processor cores to the driverdomain. This avoids scheduling delays, but often leavescores idle. In fact, dedicating CPU resources for back-end processing is not limited to just driver domain-basedarchitectures. There have also been several proposalsto offload some of the packet processing to dedicatedcores—including Kumar et al.’s sidecore approach [17],and Landau et al.’s split execution (SplitX) model [18].But, this can lead to underutilization of these cores. Fur-ther, this goes against one of the fundamental tenets ofvirtualization, which is to enable the most efficient uti-lization of the server resources. Hence the Hyper-Switchhas been designed to smartly and dynamically utilize theavailable resources.

At the same time, reliability cannot be ignored, es-pecially as servers in datacenters move toward multi-tenancy. Hypervisors that support driver domains arepotentially more robust and fault tolerant. However,driver domains incur significant overheads. These over-heads are due to the costs of moving packets betweenthe guest VMs and the driver domain [21, 29, 31], be-cause the driver domain cannot trivially access a packetin the guest VM’s memory. The driver domain is just

2

Page 3: Hyper-Switch: A Scalable Software Virtual Switching Architecture

USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 15

NIC

physicaldriver

Guest VM Driver Domain

Hardware

Management Domain

Hyper-Switchcontrol plane

Hyper-Switch data plane

control vNIC vNIC

Hyper-Switch ext driverPV network

driver

vNIC

Guest VM

Virtualized Server

PV networkdriver

Hypervisor

Figure 1: The Hyper-switch architecture. The lasthop virtual switch is implemented partly in the hy-pervisor (data plane) and partly in the managementlayer (control plane). The device drivers are hostedin the driver domain.

Guest VM Driver Domain

Hardware

physicaldriver

NIC

vNICPV networkdriver

Guest VM

Virtualized Server

PV networkdriver

virtual switch

vNIC

Hypervisor

Figure 2: Traditional driver domain architecture.The driver domain hosts the device drivers and thelast hop virtual switch.

another VM and the hypervisor maintains memory iso-lation between VMs. So, the driver domain must use anexpensive memory sharing mechanism provided to ac-cess the packet. Hypervisor-based architectures do notincur these memory sharing overheads since the packetsin the guest VMs’ memory can be directly accessed fromthe hypervisor. The Hyper-Switch has been designed totake advantage of driver domains without incurring theassociated memory sharing overheads.

3 Hyper-Switch Design

Figure 1 illustrates the Hyper-Switch architecture.There are two fundamental aspects to this architecture.First, unlike existing systems that use driver domains,the Hyper-Switch architecture—as the name implies—implements the virtual switch inside the hypervisor. Sointernal network traffic between virtual machines (VMs)that are collocated on the same server is handled entirelywithin the hypervisor. Incoming external network trafficis initially handled by the driver domain, since it hoststhe device drivers, and then is forwarded to the destina-tion VM through the Hyper-Switch. For outgoing ex-ternal traffic, these two steps are simply reversed. Inessence, from the Hyper-Switch’s perspective, two guestVMs form the endpoints for internal network traffic, andthe driver domain and a guest VM form the endpoints forexternal network traffic. Contrast this with the traditionaldriver domain architecture as illustrated in Figure 2.

Second, the hypervisor implements just the data-planeof the virtual switch that is used to forward networkpackets between VMs. The switch’s control plane isimplemented in the management layer. So the vir-tual switch implementation is distributed across virtu-alization software layers with only the bare essentialsimplemented inside the hypervisor. The separation of

control and data planes is achieved using a flow-basedswitching approach. This approach has been previouslyused in other virtual switching solutions such as OpenvSwitch [24]. However, Open vSwitch’s control and dataplanes are both implemented inside the driver domain.

The rest of this section describes the Hyper-Switch’sdesign in detail. First, the basics are explained by de-scribing the path taken by a network packet. This isfollowed by several performance optimizations that im-prove upon the basic design.

3.1 Basic DesignPacket processing by the Hyper-Switch begins at thetransmitting VM (or driver domain) where the packetoriginates and ends at the receiving VM (or driver do-main) where the packet has to be delivered. It proceedsin four stages: (1) packet transmission, (2) packet switch-ing, (3) packet copying, and (4) packet reception.Packet Transmission . In the first stage, the transmit-ting VM pushes the packet to the Hyper-Switch for pro-cessing. Packet transmission begins when the guestVM’s network stack forwards the packet to its para-virtualized network driver. Then the packet is queuedfor transmission by setting up descriptors in the trans-mit ring. A single packet can potentially span multipledescriptors depending on its size. Typically, packets arenever segmented in the transmitting guest VM. So thepackets belonging to internal network traffic can be for-warded as is. The external packets are segmented eitherin the driver domain or the network hardware. Today,segmentation offload is a standard feature in most net-work devices.Packet Switching . In the second stage of packet pro-cessing, the packet is switched to determine its desti-nation. Switching is triggered by a hypercall from the

3

Page 4: Hyper-Switch: A Scalable Software Virtual Switching Architecture

16 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association

transmitting VM. It begins with reading the transmit ringto find new packets. Each packet is then pushed to theHyper-Switch’s data plane where it is processed usingflow-based packet switching. The data plane must beable to read the packet’s headers in order to switch it.Since the data plane is located in the hypervisor, whichhas direct access to every VM’s memory, it can read theheaders directly from the transmitting VM’s memory.

Packet switching by the data plane proceeds as fol-lows: (1) The packet header fields are parsed to identifythe corresponding packet flow. (2) The packet flow isused to lookup a matching flow rule in a software flowtable. When this lookup fails, the packet is forwarded tothe control plane in the management layer. (3) A suc-cessful flow table lookup identifies a flow rule, whichspecifies one or more actions to be performed. Typically,the action is to forward the packet to one or more destina-tion ports or to drop the packet. Each port has an internalreceive queue where the switched packet is temporarilyplaced. This port corresponds to a virtual network inter-face (vNIC) within the destination VM.

When the flow table lookup fails, the packet is for-warded to the control plane through a separate controlinterface. The control plane decides how the packet mustbe forwarded based on packet filtering rules, forwardingentries from an Ethernet address learning service, and/orother protocol specific tables. This is composed into anew flow rule that specifies the actions to be performedon packets belonging to this flow. Then the packet isre-injected into the hypervisor’s data plane and the as-sociated actions are executed. Finally, the control planeadds the new flow rule to the flow table. This allows theflow’s subsequent packets to be handled entirely withinthe hypervisor’s data plane.Packet Copying . In the third stage of packet process-ing, the switched packet is copied into the receivingVM’s memory. Empty buffers for receiving new packetsare provided through the vNIC. Specifically, the descrip-tors in the receive ring provide the address of the emptybuffers in the VM’s memory.

By default, the destination VM is responsible for per-forming packet copies. Once switching is completed, thedestination VM is notified via a virtual interrupt. Subse-quently, that VM issues a hypercall. While in the hyper-visor, it dequeues the packet from the port’s internal re-ceive queue, and copies the packet into the memory givenby the next descriptor in the receive ring. The packetis copied directly from the transmitting VM’s memoryto the receiving VM’s memory. After which, the mem-ory that was allocated for this packet at various places—inside the hypervisor and in the transmitting VM—is re-leased.

Packet Reception . In the fourth and final stage, thepara-virtualized network driver in the receiving VM re-constructs the packet from the descriptors in the receivering. Typically, the receiving OS is notified, through in-terrupts, that there are new packets to be processed in thereceive ring. However, under the Hyper-Switch, the re-ceiving VM was already notified in the previous stage.So packet reception can happen as soon as the hypercallfor copying the packet is complete. The new packet isthen pushed into the receiving VM’s network stack.

3 .2 Preemptive Packet CopyingPacket copies are performed by default in a receivingVM’s context. When a packet is placed in the internalreceive queue, after it has been switched, the receivingVM is notified. Eventually, the receiving VM calls intothe hypervisor to copy the packet. However, delivering anotification to a VM already requires entry into the hy-pervisor. Under Xen, when there is a pending notificationto a VM, the VM is interrupted and pulled inside the hy-pervisor. Since hypercalls are expensive operations, theHyper-Switch tries to avoid them. In this case, it takesadvantage of the hypervisor entry upon event notifica-tion to avoid a separate hypercall to perform the packetcopy. Instead, the packet copy is performed preemptivelywhen the receiving VM is being notified. In essence, thepacket copy operation is combined with the notificationto the receiving VM. This optimization avoids one hy-pervisor entry for every packet that is delivered to a VM.

3 .3 Batching Hypervisor EntriesIn the Hyper-Switch architecture, as described thus far,the transmitting VM enters the hypervisor every timethere is a packet to send. Moreover, the receiving VM isnotified every time there is a packet pending in the inter-nal receive queue. As mentioned before, even this notifi-cation requires hypervisor intervention.1 Therefore, de-spite the preemptive packet copy optimization, the over-head of entering the hypervisor is incurred multiple timeson every packet.

To mitigate this overhead, we use VM state-awarebatching, which amortizes the cost of entering the hyper-visor across several packets. This approach to batchingshares some features with the interrupt coalescing mech-anisms of modern network devices. Typically, in net-work devices, the interrupts are coalesced irrespective1In Xen, notifying a running guest VM involves two entries into thehypervisor. First, the running VM is interrupted via an IPI and forcedto enter the hypervisor. Then the hypervisor runs a special exceptioncontext where the guest VM handles all pending notifications. Finally,the guest VM again enters the hypervisor to return from the exceptioncontext.

4

Page 5: Hyper-Switch: A Scalable Software Virtual Switching Architecture

USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 17

of whether the host processor is busy or not. But, un-like those devices the Hyper-Switch is integrated withinthe hypervisor, where it can easily access the schedulerto determine when and where a VM is running. So ablocked VM can be notified immediately when there arepackets pending to be received by that VM. This enablesthe VM to wake up and process the new packets withoutdelay. On the other hand, the notification to a runningVM may be delayed if it was recently interrupted.

3.4 Offloading Packet ProcessingIn Hyper-Switch, by default, packet switching is per-formed in the transmitting VM’s context and packetcopying is performed in the receiving VM’s context. Asa result, asynchronous packet switching does not occurwith respect to the transmitting VM, and asynchronouspacket copying does not occur with respect to the receiv-ing VM. However, concurrent and asynchronous packetprocessing can significantly improve performance.

Concurrent packet processing can be achieved bypolling: (1) all the internal receive queues, looking forpackets waiting to be copied, and (2) all the transmitrings, looking for packets waiting to be switched. Thiscan be performed by processor cores that are currentlyidle. Packet copying is prioritized over switching be-cause packet copying is typically the more expensive op-eration and the receiving VM is more likely to be perfor-mance bottlenecked than a transmitting VM.

Instead, the idle cores are woken up just when there iswork to be done. On the receive side, this can be ascer-tained precisely when packets are placed in an internalreceive queue of a vNIC. Then one of the idle cores ischosen and woken up to perform the packet copy. A low-overhead mechanism is used to offload work to the idlecores. It uses a simple interprocessor messaging facilityto request a specific idle core to copy packets at a spe-cific vNIC. Further, this mechanism attempts to spreadthe work across many idle cores. Otherwise, if all thework is offloaded to a single idle core, it might become abottleneck.

The offloading to idle cores is delayed if the receiv-ing VM is going to be notified immediately. As ex-plained previously, this typically happens when the re-ceiving VM is not running. Subsequently, the receiv-ing VM copies a bounded number of packets sufficientto keep it busy, and then if packets are still pending inthe internal receive queue, the remaining copies are of-floaded to an idle core. The rationale is to immediatelycopy some packets so that the receiver can start process-ing them, while the remaining packets are concurrentlycopied at an idle core.

Unfortunately, offloading packet switching to idlecores is not trivial. In the common case, packets are

asynchronously queued by the transmitting VM with-out entering the hypervisor. So it is not possible tooffload the switching tasks precisely when packets arequeued. Therefore, packet switching is performed atthe idle cores only as a side effect of offloading packetcopies. In other words, when an idle core is woken up toperform packet copies, it also polls all the transmit ringslooking for packets pending to be switched.

Further, when packets are being processed by an idlecore, the Hyper-Switch checks for any other work thatmight need that core. If so, it aborts the packet process-ing. This ensures that the offloaded packet processinghappens at the lowest possible priority and does not pre-vent other tasks from using that processor.CPU Cache Awareness . CPU cache locality can havea significant impact on the cost of packet copying underHyper-Switch. Essentially, the packet data is accessed inthree places:2 (1) The transmitting VM, (2) the packetcopier, and (3) the receiving VM. So the packet data canbe potentially brought into three different CPU cachesdepending on the system’s cache hierarchy and where thetwo VMs and the packet copier are run.

If the receiving VM is also the packet copier, thenthe packet data is brought into the receiving VM’s CPUcache while the copy is performed. Subsequently, whenthe packet is accessed in the receiving VM, it can be readwith low latency from the cache. But if the packet copierruns on an idle core, the access latency will depend onwhether the idle core shares any cache with the receivingVM’s core. Therefore, under Hyper-Switch, the offloadmechanism for packet processing is optimized to take ad-vantage of CPU cache locality. At the same time, it en-sures that the offloaded work does not unfairly affect theperformance of other VMs running on cores that sharetheir CPU cache with the idle cores.Hysteresis . Waking up an idle core takes a non-trivialamount of time, particularly when the idle core is us-ing deeper sleep states to save power. Further, the inter-processor interrupts (IPIs) that are used to wake up coresare not cheap. Therefore, a small hysteresis period is in-troduced to ensure that the idle cores stay awake longerthan they normally would. The idea is to keep the coresrunning, after they are woken up, until there is a period—the hysteresis time period—during which no packets areprocessed. In other words, the idle cores are kept runningas long as there is a steady stream of packets to process.

3 .5 More Packet Processing OpportunitiesA packet that is queued in the transmit ring at a vNIC willeventually be switched by either the transmitting VM or2Packet switching is ignored here since it only accesses the packetheaders.

5

Page 6: Hyper-Switch: A Scalable Software Virtual Switching Architecture

18 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association

an idle core. This might happen immediately if an idlecore polls this interface looking for packets waiting to beswitched or it might happen only when the transmit timerperiod that is implemented by VM-state aware batchingelapses.3

Consider a VM that queues some packets for transmis-sion at its vNIC and then blocks. Let’s assume that thereare no other idle cores. If another VM is scheduled torun on this core, then the queued packets are not goingto be switched until the blocked VM is scheduled to runagain. But this might happen only at the end of the trans-mit timer period. Even if the core becomes idle after theVM blocks, there is no guarantee that the blocked VM’spackets will be switched at that idle core. In fact, the idlecore can end up copying packets destined for other VMs.In essence, a VM can block despite its packets waiting tobe switched.

When a VM’s virtual processor blocks, it has to en-ter the hypervisor to give up its core. Since the VM isalready inside the hypervisor, it might as well as checkif there are packets pending to be switched or copied.This allows any packet processing work to be completedbefore the VM stops running. Also, new packet copiesresult in a notification to the VM. Consequently, insteadof blocking, the VM returns to process the packets thatwere just received.

4 Implementation Details

We implemented a prototype of the Hyper-Switch archi-tecture, which is depicted in Figure 3. We implementedthe switch’s data plane by porting parts of Open vSwitchto the Xen hypervisor. Open vSwitch’s control planewas used without modification. We also developed anew para-virtualized (PV) network interface for the guestVMs to communicate with the data plane. The same in-terface was also used by the driver domain to forward ex-ternal network traffic. The rest of this section describeseach part of the Hyper-Switch prototype in detail.Open vSwitch Overview . Open vSwitch [24] is anOpenFlow compatible, multi-layer software switch forcommodity servers. The control and data planes areseparated. While the data plane is implemented insidethe OS kernel, the control plane is implemented in userspace. It uses the flow-based approach for switchingpackets in its data plane. In a typical deployment of OpenvSwitch as a last hop virtual switch, it is implementedentirely inside a driver domain (Xen) or the hypervisor(KVM). In the common case, the network traffic betweenthe guest VMs is directly switched by Open vSwitch’sdata plane within the kernel. Open vSwitch provides avport abstraction that can be bound to any network inter-3The maximum delay is bounded by the transmit timer period.

datapath glue

Open vSwitchcontrol plane

User space

Kernel

Driver Domain

Open vSwitch

datapathvport

vportvport

Guest VM

Xen Hypercalls

Xen Hypervisor

control

Guest VMGuest VM

Figure 3: Hyper-Switch prototype. It was built by portingessential parts of Open vSwitch’s datapath to the Xenhypervisor.

face in the driver domain. In addition, there is one vportfor every vNIC in the system.Porting Open vSwitch’s Datapath. We implementedthe Hyper-Switch’s data plane by porting OpenvSwitch’s datapath to the Xen hypervisor. The vportson the datapath were bound to a newly developed para-virtualized network interface that allowed guest VMs tocommunicate with the Hyper-Switch’s data plane.

The driver domain kernel also included a datapath gluelayer to enable communication between the control anddata planes. This layer converted the commands fromOpen vSwitch’s control plane into a new set of Xen hy-percalls to manipulate the flow tables in the datapath.The glue layer also transferred the packets that are puntedto the control plane.Para-virtualized Network Interface . The guest VMsand the driver domain communicated with the Hyper-Switch through a para-virtualized network interface(vNIC). The interface included two transmit rings—onefor queueing packets for transmission and another for re-ceiving transmission completion notifications—and onereceive ring to deliver incoming packets. The ringswere fixed circular buffers where the producer and con-sumer(s) could access the ring descriptors concurrently.The interface also included an internal receive queue thatcontained packets that were yet to be copied into the re-ceiving VM’s memory.Hypervisor Integration . As explained in Section 3.2,packet copying was preemptively performed by combin-ing it with the notification to the receiving VM. We im-plemented this by checking for packets to copy when theassociated virtual interrupt was delivered by the Xen hy-pervisor to a VM. Further, packet switching and copy-ing were also performed when a VM voluntarily blocked.

6

Page 7: Hyper-Switch: A Scalable Software Virtual Switching Architecture

USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 19

Thus the VM’s vNIC was polled for packets to be copiedor switched, just before the scheduler was invoked toyield the processor and find another VM to run.Offloading Packet Processing. We implemented theoffloading of packet processing inside Xen’s idle do-main. The idle domain contains one idle vCPU for ev-ery physical CPU core in the system. The idle vCPUshave the lowest priority among all the vCPUs and there-fore, they are scheduled to run on a physical CPU coreonly when none of the VMs’ vCPUs are runnable on thatcore. The idle vCPUs execute an idle loop that checksfor pending softirqs and tasklets, and executes the corre-sponding handlers. Finally, when there is no more workto be done, it enters one of the sleep states to save power.

In the Hyper-Switch architecture, we extended Xen’sidle loop to copy and switch packets. A simple, low-overhead mechanism was used to offload packet process-ing to idle cores. The mechanism identified a suitableidle core based on an offload criteria. The criteria werechosen to select an idle core that made the best use ofthe CPU caches. Further, this mechanism also ensuredthat the offloaded work was distributed across multipleidle cores using a simple hash function. The mecha-nism included a lightweight interprocessor messaging fa-cility that was implemented using small fixed circularbuffers. There was one buffer for every processor corein the system. It was used to communicate the vNICsthat were being offloaded to a specific idle core. TheHyper-Switch-related packet processing was performedonly at the lowest priority. The pending softirqs andtasklets were checked after each packet was processed.If there was ever higher priority work to be done, thenthe offloaded packet processing was aborted.

5 Evaluation

This section presents a detailed evaluation of the Hyper-Switch architecture. The evaluation was performed usingthe Hyper-Switch prototype in Xen. The primary goal ofthis evaluation was to compare Hyper-Switch with ex-isting architectures that implement the virtual switch ei-ther entirely within the driver domain or entirely withinthe hypervisor. Toward this end, the end-to-end perfor-mance under Hyper-Switch was compared to that un-der Xen’s default driver domain-based architecture andKVM’s hypervisor-based architecture.

5 .1 Experimental Setup and MethodologyThe experiments were run on a 32-core server with two2.2 GHz AMD Opteron 6274 processors and 64 GB ofmemory. This processor is based on AMD’s Bulldozermicro-architecture where two cores (called a module)

share the second level data cache (L2) and the instruc-tion caches (L1i and L2i). Further, four modules (calleda node) share the unified third level cache (L3). Andeach Opteron 6274 processor includes two such nodes.Under Xen,4 the server was configured to run up to 32para-virtualized (PV) Linux guest VMs (v2.6.38 pvops)and one PV Linux driver domain (v2.6.38 pvops), in ad-dition to the privileged management domain 0 (Linuxv3.4.4 pvops). The PV linux guests use a specializednetwork driver which is optimized for the virtual net-work interface that the hypervisor provides to the VMs.The guest VMs were each configured with a single vir-tual CPU (vCPU) and 1 GB of memory. The driver do-main was configured with up to 8 vCPUs and 2 GB ofmemory. But under Hyper-Switch the driver domain wasgiven only a single vCPU since it only handled externalnetwork traffic. The server was directly connected to anexternal client using a 10 Gbps Ethernet link. The clientconsisted of a 2.67 GHz Intel Xeon W3520 quad-coreCPU and 6 GB of memory. It ran an Ubuntu distributionof native Linux kernel v2.6.32. The CPUs at the externalclient were never a performance bottleneck in any of theexperiments.

The netperf microbenchmark [2] was used in all theexperiments to generate network traffic. In particular,netperf was used to create two types of network traf-fic: (1) TCP stream and (2) UDP request/response traffic.The TCP stream traffic was used to measure the achiev-able throughput. The UDP request/response traffic wasused to measure the packet processing latency. Unlessotherwise specified, the sendfile option was used onthe transmit side in all experiments. The performanceof Hyper-Switch was compared to the performance ofOpen vSwitch under both Xen [13] and KVM [26]. Para-virtualized network interfaces were used in all these sys-tems. In the rest of this section, we use “KVM” to referto the performance of Open vSwitch under KVM. Simi-larly, we use “Xen” to refer to the performance of OpenvSwitch under Xen’s default network I/O architecture.This should not be confused with the Hyper-Switch pro-totype that is also implemented in Xen.Open vSwitch under Xen . In Xen, Open vSwitch isimplemented entirely in the driver domain. Under Xen,all network packets are forwarded to the driver domain,where they are switched. Xen’s backend driver callednetback acts as an intermediary between the guest VMsand the virtual switching module in the driver domain.Netback is multi-threaded, and there is one netback (ker-nel) thread for every vCPU in the driver domain. Eachguest VM’s vNIC is bound to one of these threads. Thepackets associated with a specific vNIC are processedonly by the thread to which it is bound. The recom-4Xen v4.2 - mainline git repository (xen-unstable.git) May 2012.

7

Page 8: Hyper-Switch: A Scalable Software Virtual Switching Architecture

20 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association

64K 16K 4K 1K 256

TCP Payload Size (bytes)

0

2

4

6

8

10

12

14

16

18T

hro

ugh

pu

t(G

bp

s)Hyper-Switch Xen KVM

Figure 4: Throughput results from TCP stream traffic be-tween a single pair of VMs under different payload sizes.

64K 16K 4K 1K 256

TCP Payload Size (bytes)

0

50

100

150

200

250

300

350

400

Th

ou

san

ds

of

Cycl

es/

Pk

t

Hyper-Switch Xen KVM

Figure 5: CPU load results from TCP stream traffic be-tween a single pair of VMs under different payload sizes.

mended practice is to dedicate cores for running thedriver domain’s vCPUs. In this evaluation, the driver do-main was configured with up to 8 vCPUs.Open vSwitch under KVM . In KVM, Open vSwitchis implemented entirely in the hypervisor (also referredto as the KVM host). Under KVM’s vhost-net architec-ture, all network packets are forwarded to the vhost-netdriver in the host, which is similar to Xen’s netback.But unlike netback, there is a separate vhost-net (kernel)thread for every vNIC in the system. The vhost-net’sthreads can also be run on dedicated cores.

5 .2 Experimental Results5 .2 .1 Inter-VM Performance and Scalability

In these experiments, network performance was studiedunder different loads by setting up network traffic be-tween VMs collocated on the same server.Single VM Pair . In the first set of experiments, trafficwas set up between just a single pair of VMs. Each guestVM’s vCPU was pinned to a separate core within thesame processor node to avoid any potential VM schedul-ing effects. Xen’s driver domain was configured with 2vCPUs. Recall that there is one netback kernel threadfor every vCPU in Xen’s driver domain. The driver do-

64K 16K 4K 1K 256 1

UDP Payload Size (bytes)

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Tra

nsa

ctio

ns

per

Seco

nd

Hyper-Switch Xen KVM

Figure 6: Latency results from UDP request/responsetraffic between a single pair of VMs under different pay-load sizes.

main’s vCPUs were also pinned to separate processorcores, but on the same processor node where the cor-responding guest VMs’ vCPUs were pinned. Similarly,under KVM, the two vhost-net kernel threads (one perguest VM) were also pinned.

First, as shown in Figure 4, higher throughput wasachieved under Hyper-Switch than under both the ex-isting architectures in the experiments where the TCPpayload was between 4 KB and 64 KB, with stream-based traffic. On average, the throughput under Hyper-Switch, in these cases, was ∼56% higher than that underXen and ∼61% higher than that under KVM. But therewas not much performance difference at smaller packetsizes since in those experiments the transmitting VM wasthe performance bottleneck. Figure 5 shows the averageCPU load (cycles/packet) in each of these experiments.Clearly, the Hyper-Switch is more efficient in processingpackets than both the existing architectures in KVM andXen.

Second, as shown in Figure 6, higher transactionsper second was achieved under Hyper-Switch, acrossall UDP payload sizes, with request-response traffic. Atransaction comprises of a single request followed by asingle response in the opposite direction. So these re-sults indicate that the round-trip packet latencies werethe lowest under the Hyper-Switch among all the threearchitectures. On average, the transactions per secondunder Hyper-Switch was ∼117% higher than that underXen and ∼222% higher than that under KVM. So theHyper-Switch architecture is suited for both bulk as wellas latency sensitive network traffic. Further, these resultsshow the benefit from optimizations such as preemptivecopying and immediate notification of blocked VMs thatenable timely delivery of packets.Pairwise Scalability Experiments . In the next set ofexperiments, the performance scalability of the three ar-chitectures was studied by setting up TCP stream-basedtraffic flows between 1–16 pairs of VMs in one direc-tion. TCP payload size of 64 KB was used in all the sub-

8

Page 9: Hyper-Switch: A Scalable Software Virtual Switching Architecture

USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 21

2 4 6 8 10 12 14 16

Number of VM Pairs

0

20

40

60

80

100T

hro

ugh

pu

t(G

bp

s)Hyper-Switch KVM Xen

Figure 7: Pairwise performance scalability results.Multiple concurrent TCP streams set up betweenpairs of VMs. Figure shows aggregate inter-VMthroughput as the number of VM pairs is increased.

2 6 10 14 18 22 26 30

Number of VMs

0

20

40

60

80

100

Th

rou

gh

pu

t(G

bp

s)

Hyper-Switch KVM Xen

Figure 8: All-to-all performance scalability results.Multiple concurrent TCP streams set up between allVMs. Figure shows aggregate inter-VM throughputas the number of VMs is increased.

sequent experiments. Again, the guest VMs’ vCPUs, thevhost-net kernel threads (KVM), and the driver domains’vCPUs (Xen) were pinned to specific processor cores.Also, the VMs that were communicating with each otherwere always pinned to the same processor node. Thepinning was done such that, in each experiment, the loadwas uniformly distributed across all the processor mod-ules and nodes in the system. For instance, one corefrom each module was used for pinning VMs across thesystem, before the other cores were used. Under KVM,the guest VMs’ vCPUs and the vhost-net kernel threadswere pinned. As a result, when the system was scaledbeyond 8 pairs of VMs, each processor core had to runone of the guest VM’s vCPUs and one of the vhost-netthreads. Under Xen, the driver domain was configuredwith 8 vCPUs. The driver domain’s vCPUs were dis-tributed by pinning two of them to each processor nodein the system. Then the guest VMs’ vCPUs were evenlydistributed across the remaining processor cores.

The results in Figure 7 show that the Hyper-Switch ar-chitecture exhibited much better performance scalabilitythan both the existing architectures. Specifically, underHyper-Switch, the performance reached a peak through-put of ∼81 Gbps before it started to flatten out. Butthe peak throughput was only ∼47 Gbps and ∼31 Gbpsunder KVM and Xen respectively. Further, the perfor-mance under these existing architectures did not scalebeyond 4 pairs of VMs. On average, the throughputunder Hyper-Switch was ∼55% higher than that underKVM and ∼146% higher than that under Xen. Fig-ure 7 also shows three distinct regions in Hyper-Switch’sperformance curve: (1) The performance scaled almostlinearly, from ∼16.2 Gbps to ∼62.7 Gbps, between 1and 4 pairs of VMs. (2) The performance continued toscale linearly but at a lower rate, from ∼62.7 Gbps to∼81 Gbps, between 5 and 7 pairs of VMs. (3) The per-formance did not scale beyond 8 pairs of VMs.

Fundamentally, the network performance is deter-

mined by the number of packets that can be transferredbetween the source and destination VMs in a given time.A typical packet transfer involves switching and packetcopying overheads. But there are limits to how manypackets that can be processed by a single processor. Thisis determined in part by the underlying hardware archi-tecture. The hardware determines how efficiently theavailable processor time is used to process—switch andcopy—packets. Today’s processors are incredibly com-plex and therefore, there are several factors that impactthis efficiency. In particular, the structure of the mem-ory subsystem can have a significant impact on per-formance [25]. This includes the size and levels ofthe CPU caches, the maximum number of outstandingreads/writes/cache misses, the available memory band-width, the number of channels to the system memory,and so on. One can scale the performance beyond thelimits imposed by a single processor core by increas-ing concurrency, i.e. by using multiple processor cores.But some of the system resources could be shared be-tween processor cores—such as CPU caches, memorychannels, etc.—that could potentially reduce the avail-able concurrency. When additional VMs are added to thesystem, there is a natural increase in concurrency sincemany of the switching tasks can be concurrently per-formed under each VM’s context. Further, under Hyper-Switch, the offloading of packet processing adds to thisconcurrency. When choosing an idle core for offload-ing packet processing, preference is given to idle pro-cessor cores that are on the same node as the receivingVM’s vCPU, to take advantage of any CPU cache local-ity. Further, the packet processing is offloaded only toa processor core in an idle module, i.e. a module whereboth the processor cores are idle, to avoid potential cacheinterference effects [3].

In the first region of the curve (Figure 7), each pairof VMs were run in a separate processor node. Sothe packet processing could be offloaded to other cores

9

Page 10: Hyper-Switch: A Scalable Software Virtual Switching Architecture

22 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association

within the same node. As a result, under these condi-tions, the best scalability was achieved. In the second re-gion, some of the packet processing had to be offloadedto idle modules on other nodes in the system. This wasnot as efficient since packets had to be copied across pro-cessor nodes. Hence the performance scalability wasreduced. In the third region, the performance stoppedscaling in part due to the reduction in the offloading ofpacket processing since most of the processor moduleswere busy. Also, some of the VMs’ vCPUs were run-ning on two cores within the same processor module.So the cache interference effects also came into effect.Finally, as more VMs were added to the system, therewas increased contention for the system resources suchas CPU caches. So, effectively, all these factors offsetthe increase in packet processing concurrency and hencethe performance stopped scaling.All-to-all Scalability Experiments . In the second setof scalability experiments, TCP stream-based traffic wasset up between every pair of VMs in the system in bothdirections. These experiments were designed to generatesignificant load on the network by having tens of VMsconcurrently communicating with each other. For in-stance, when there were 30 VMs in the system, therewere as many as 870 concurrent TCP flows. The con-figuration and setup was similar to the previous set ofexperiments.

Figure 8 shows the results from these experiments.The performance again scaled much better under Hyper-Switch than under KVM or Xen. Specifically, un-der Hyper-Switch, the performance reached a peakthroughput of ∼65 Gbps as compared to ∼55 Gbps and∼31 Gbps under KVM and Xen respectively. Similar tothe previous set of experiments, the performance curveunder Hyper-Switch scaled up very well at the beginningbefore tapering off. The performance analysis presentedwith the previous results is applicable here as well. Infact, the contention for system resources is even higherin this case, due to the significant load placed on the sys-tem.

5 .2 .2 External Performance

In the external experiments, the network traffic was setup between guest VM(s) and the external client. Thedriver domain, under Hyper-Switch and Xen, was con-figured with only a single vCPU. In the TX and RX ex-periments, there were one or two guest VMs (concur-rently) sending and receiving packets respectively. Theguest VMs’ vCPUs and the driver domain’s vCPU wereagain pinned.

The results from these experiments showed that theHyper-Switch’s performance was comparable and insome cases even better than the performance under KVM

and Xen. In the TX experiments, with a single guestVM transmitting packets, line rate of ∼9.4 Gbps wasachieved under both Hyper-Switch and Xen. But underKVM, the TX VM’s vCPU was a performance bottle-neck. Therefore, only ∼7.8 Gbps was possible in thiscase. In the RX experiments, with one guest VM re-ceiving packets, the CPU at the guest VM was the bot-tleneck. So line rate was not achieved under any ofthe architectures. But the performance was better underHyper-Switch (7.5 Gbps) and KVM (7.8 Gbps) than Xen(4.1 Gbps). But with two guest VMs receiving packets,line rate of ∼9.4 Gbps was achieved under both Hyper-Switch and KVM. Under Xen, the driver domain’s vCPUwas the performance bottleneck. Therefore, having asecond guest VM receive packets had no positive impacton the aggregate throughput. These results show that thedriver domain under Hyper-Switch can send and receivepackets at 10 GbE line rate using a single CPU core. Sothe driver domain consumes minimal resources.

TCP request-response traffic was also set up between asingle guest VM and the external client. In these exper-iments, the Hyper-Switch achieved 13,243 transactionsper second of as compared to 10,721 and 11,342 underKVM and Xen respectively. As explained before, highertransactions per second indicate lower round trip latency.Therefore, despite the “longer” route taken by packetsunder Hyper-Switch due to their forwarding through boththe hypervisor and the driver domain, the packet laten-cies were still the lowest under Hyper-Switch.

5.2.3 Design Evaluation

Experiments were also run to determine the offload crite-ria under Hyper-Switch. In these experiments, the packetprocessing was offloaded to different CPU cores relativeto where the transmitting and receiving VMs’ vCPUswere running. The results from these experiments5 indi-cated that, for best performance, packet processing mustbe offloaded to a processor module where both the coreswere idle. Further, while searching for idle modules, firstthe processor node on which the receiving VM’s vCPUwas running must be searched, before searching the othernodes in the system. However, the offload criteria couldvary depending on a processor’s cache hierarchy. So theexact criteria must be determined based on the particularhardware platform on which Hyper-Switch is run.

6 Related Work

The current state-of-the-art network subsystem architec-tures for virtualized servers can be broadly classified5Due to lack of space, the results from these experiments are not pre-sented here.

10

Page 11: Hyper-Switch: A Scalable Software Virtual Switching Architecture

USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 23

into three categories. The first category of systems in-cludes a simple network card (NIC) that is virtualizedby a software intermediary, either the hypervisor (e.g.KVM [26], VMware ESX server [10]) or a driver do-main (e.g. Xen [13]). Today, this category of systems ismost commonly used in virtualized servers since it offersa rich set of features, including security, isolation, andmobility. There are several software virtual switches—such as Linux bridge [1], VMware vswitch [32], CiscoNexus 1000v [8], Open vSwitch [24], etc.—that are usedin these systems. Recently, Rizzo et al. [30] also pro-posed a new virtual switching solution based on theirnetmap API. They use memory-mapped buffers to avoiddata copies inside the host. It will be interesting to see ifthe netmap API can be exported all the way to the VMs.Unlike Hyper-Switch, all these existing systems imple-ment the entire virtual switch within a single softwaredomain—either the hypervisor or the driver domain. Butwe believe that the optimizations proposed in this paperare applicable to many of these solutions. Further, in thispaper, the Hyper-Switch’s performance was only com-pared to the performance under Xen and KVM. A recentreport from VMware has shown an impressive perfor-mance of 27 Gbps between two VMs running on theirvSphere architecture [33]. Unfortunately, it is hard tocompare this to Hyper-Switch’s performance since thehardware platforms used in the evaluations are vastly dif-ferent.

The second category of systems employ more sophis-ticated NICs (direct-access NICs) with multiple con-texts that present a vNIC interface directly to eachVM [20, 27, 34]. Today, there exists an industry-widestandard called SR-IOV, which has been adopted by sev-eral network interface vendors to implement this solu-tion [6, 16, 22]. These NICs also implement a virtualswitch internally within the hardware. However, todaymost of them only implement a rudimentary form ofswitch. The sNICh [28] architecture explores the ideaof switch/server integration. It implements a full-fledgedswitch while enabling a low cost NIC solution, by ex-ploiting its tight integration with the server internals.This makes sNICh more valuable than simply a combina-tion of a network interface and a datacenter switch. Luoet al. [19] propose offloading Open vSwitch’s in-kerneldata path to programmable NICs. Similarly, one can alsoimagine offloading Hyper-Switch’s data plane to the NIChardware. These solutions can enable high-performancesince the VMs directly communicate with the NIC. But,in general, they lack the flexibility that pure software so-lutions offer.

The third category of switches attempt to leveragethe functionality that already exist in today’s datacen-ter switches. This approach uses an external switch forswitching all network packets [9, 23]. But, fundamen-

tally, this approach results in a waste of network band-width since even packets from inter-VM traffic musttravel all the way to the external switch and back again.

There have also been proposals to distribute virtualnetworking across all endpoints within a data center [7,11]. Here the software-based components reside on allservers that collaborate with each other and implementnetwork virtualization and access control for VMs, whilenetwork switches are completely unaware of the indi-vidual VMs on the end-points. All these architecturesare aimed at solving the network management problem,which is not the focus of this paper. But the Hyper-Switch can easily be a part of these solutions.

7 Conclusions

This paper presented the Hyper-Switch architecture thatcombines the best of the existing last hop virtual switch-ing architectures. It hosts the device drivers in a driverdomain to isolate any faults and the last hop virtualswitch in the hypervisor to perform efficient packetswitching. In particular, the hypervisor implements justthe fast, efficient data plane of a flow-based softwareswitch. The driver domain is needed only for handlingexternal network traffic.

Further, this paper also presented several carefully de-signed optimizations that enable efficient packet process-ing, better utilization of the available CPU resources, andhigher concurrency. The optimizations take advantage ofthe Hyper-Switch data plane’s integration within the hy-pervisor. As a result, the Hyper-Switch enables much im-proved and scalable network performance, while main-taining the robustness and fault tolerance that derive fromthe use of driver domains. Moreover, these optimizationsshould be a part of any virtual switching solution thataims to deliver high performance.

This paper also presented an evaluation of the Hyper-Switch architecture using a prototype implemented inthe Xen platform. The evaluation showed that, forinter-VM network communication, the Hyper-Switchachieved higher performance and exhibited better scal-ability than both Xen’s default network I/O architectureand KVM’s vhost-net architecture. Further, the externalnetwork performance under Hyper-Switch was compara-ble and in some cases even better than the performanceunder Xen and KVM.

References

[1] Linux Ethernet bridge. http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge.

[2] Netperf: A network performance benchmark. http://www.

netperf.org, 1995. Revision 2.5.

[3] AMD CORPORATION. Shared level-1 instruction-cache perfor-mance on AMD family 15h CPUs.

11

Page 12: Hyper-Switch: A Scalable Software Virtual Switching Architecture

24 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association

[4] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., HAR-RIS, T. L., HO, A., NEUGEBAUER, R., PRATT, I., ANDWARFIELD, A. Xen and the art of virtualization. In SOSP ’03:Proceedings of the 19th ACM Symposium on Operating SystemsPrinciples (October 2003), pp. 164–177.

[5] BENSON, T., AKELLA, A., AND MALTZ, D. A. Network trafficcharacteristics of data centers in the wild. In IMC (2010).

[6] BROADCOM CORPORATION. BCM57712 productbrief. http://www.broadcom.com/collateral/pb/

57712-PB00-R.pdf, January 2010.[7] CABUK, S., DALTON, C. I., RAMASAMY, H., AND SCHUNTER,

M. Towards automated provisioning of secure virtualized net-works. In CCS ’08: Proceedings of the 15th ACM Confer-ence on Computer and Communications Security (October 2007),pp. 235–245.

[8] CISCO SYSTEMS, INC. Cisco Nexus 1000V series switches.http://www.cisco.com/en/US/prod/collateral/

switches/ps9441/ps9902/data_sheet_c78-492971.pdf,August 2011.

[9] CONGDON, P. Virtual Ethernet port aggregator.http://www.ieee802.org/1/files/public/docs2008/

new-congdon-vepa-1108-v01.pdf, November 2008.[10] DEVINE, S., BUGNION, E., AND ROSENBLUM, M. Virtualiza-

tion system including a virtual machine monitor for a computerwith a segmented architecture. US Patent #6,397,242 (October1998).

[11] EDWARDS, A., FISCHER, A., AND LAIN, A. Diverter: A newapproach to networking within virtualized infrastructures. InWREN ’09: Proceedings of the ACM SIGCOMM Workshop: Re-search on Enterprise Networking (August 2009).

[12] ESMAEILZADEH, H., BLEM, E., ST. AMANT, R.,SANKARALINGAM, K., AND BURGER, D. Dark siliconand the end of multicore scaling. In Proceedings of the 38thannual international symposium on Compute r architecture(New York, NY, USA, 2011), ISCA ’11, ACM, pp. 365–376.

[13] FRASER, K., HAND, S., NEUGEBAUER, R., PRATT, I.,WARFIELD, A., AND WILLIAMS, M. Safe hardware access withthe Xen virtual machine monitor. In OASIS ’04: Proceedings ofthe 1st Workshop on Operating System and Architectural Supportfor the on demand IT Infrastructure (October 2004).

[14] GREENBERG, A., HAMILTON, J., MALTZ, D. A., AND PATEL,P. The cost of a cloud: Research problems in data center networks. SIGCOMM Computer Communcation Review 39, 1 (2009),68–73.

[15] INTEL. http://goo.gl/5lpY8, 2010. "Intel 1000 Core Chip".[16] INTEL CORPORATION. Intel 82599 10 GbE controller

datasheet. http://download.intel.com/design/

network/datashts/82599_datasheet.pdf, October 2011.Revision 2.72.

[17] KUMAR, S., RAJ, H., SCHWAN, K., AND GANEV, I. Re-architecting VMMs for multicore systems: The sidecore ap-proach. In WIOSCA ’07: Proceedings of the Workshop on theInteraction between Operating Systems and Computer Architec-ture (June 2007).

[18] LANDAU, A., BEN-YEHUDA, M., AND GORDON, A. SplitX:Split guest/hypervisor execution on multi-core. In WIOV ’11:Proceedings of the 4th Workshop on I/O Virtualization (May2011).

[19] LUO, Y., MURRAY, E., AND FICARRA, T. Accelerated vir-tual switching with programmable NICs for scalable data centernetworking. In VISA ’10: Proceedings of the 2nd ACM SIG-COMM Workshop on Virtualized Infrastructure Systems and Ar-chitectures (September 2010).

[20] MANSLEY, K., LAW, G., RIDDOCH, D., BARZINI, G., TUR-TON, N., AND POPE, S. Getting 10 Gb/s from Xen: Safe andfast device access from unprivileged domains. In Proceedingsof the Euro-Par Workshop on Parallel Processing (August 2007),pp. 224–233.

[21] MENON, A., SANTOS, J. R., TURNER, Y., JANAKIRAMAN, G.,AND ZWAENEPOEL, W. Diagnosing performance overheads inthe Xen virtual machine environment. In VEE ’05: Proceedingsof the 1st ACM/USENIX International Conference on Virtual Ex-ecution Environments (June 2005), pp. 13–23.

[22] PCI-SIG. Single Root I/O Virtualization. http://www.

pcisig.com/specifications/iov/single_root.

[23] PELISSIER, J. VNTag 101. http://www.

ieee802.org/1/files/public/docs2009/

new-pelissier-vntag-seminar-0508.pdf, 2009.

[24] PFAFF, B., PETTIT, J., KOPONEN, T., AMIDON, K., CASADO,M., AND SHENKER, S. Extending networking into the virtual-ization layer. In HotNets-VIII: Proceedings of the Workshop onHot Topics in Networks (October 2009).

[25] PORTERFIELD, A., FOWLER, R., MANDAL, A., AND LIM,M. Y. Empirical evaluation of multi-core memory concur-rency. Tech. rep., RENCI, January 2009. www.renci.org/

publications/techreports/TR-09-01.pdf.

[26] QUMRANET. KVM: Kernel-based virtualization driver. http:

//www.redhat.com/f/pdf/rhev/DOC-KVM.pdf.

[27] RAJ, H., AND SCHWAN, K. High performance and scalable I/Ovirtualization via self-virtualized devices. In HPDC ’07: Pro-ceedings of the IEEE International Symposium on High Perfor-mance Distributed Computing (June 2007).

[28] RAM, K. K., MUDIGONDA, J., COX, A. L., RIXNER, S., RAN-GANATHAN, P., AND SANTOS, J. R. sNICh: Efficient last hopnetworking in the data center. In ANCS ’10: Proceedings ofthe ACM/IEEE Symposium on Architectures for Networking andCommunications Systems (October 2010), pp. 1–12.

[29] RAM, K. K., SANTOS, J. R., TURNER, Y., COX, A. L., ANDRIXNER, S. Achieving 10 Gb/s using safe and transparent net-work interface virtualization. In VEE ’09: Proceedings of theACM SIGPLAN/SIGOPS International Conference on Virtual Ex-ecution Environments (March 2009), pp. 61–70.

[30] RIZZO, L., AND LETTIERI, G. VALE, a switched ethernet forvirtual machines. In CoNEXT ’12: Proceedings of the 8th Inter-national Conference on Emerging Networking Experiments andTechnologies (Decemeber 2012), pp. 61–72.

[31] SANTOS, J. R., TURNER, Y., JANAKIRAMAN, G., AND PRATT,I. Bridging the gap between software and hardware techniquesfor I/O virtualization. In ATC ’08: Proceedings of the USENIXAnnual Technical Conference (June 2008), pp. 29–42.

[32] VMWARE, INC. VMware virtual networking con-cepts. http://www.vmware.com/files/pdf/virtual_

networking_concepts.pdf, 2007.

[33] VMWARE, INC. VMware vSphere 4.1 networking perfor-mance. http://www.vmware.com/files/pdf/techpaper/

Performance-Networking-vSphere4-1-WP.pdf, April2011.

[34] WILLMANN, P., SHAFER, J., CARR, D., MENON, A., RIXNER,S., COX, A. L., AND ZWAENEPOEL, W. Concurrent di-rect network access for virtual machine monitors. In HPCA’07: Proceedings of the 13th International Symposium on High-Performance Computer Architecture (February 2007).

12