Upload
micchie
View
1.826
Download
4
Tags:
Embed Size (px)
Citation preview
Recent advance in netmap/VALE(mSwitch)
Michio Honda, Felipe Huici (NEC Europe Ltd.)
Giuseppe Lettieri and Luigi Rizzo (Universita di Pisa)
Kernel/VM@Jimbo-cho, Japan on Dec. 8 2013
[email protected] / @michioh
Outline
• netmap API basics – Architecture
– How to write apps
• VALE (mSwitch) – Architecture – System design
– Evaluation – Use cases
NETMAP API BASICS
netmap overview
• A fast packet I/O mechanism between the NIC and the user-space – Remove unnecessary metadata (e.g., sk_buff) allocation – Amortized systemcall costs, reduced/removed data copies
Page 4 From h,p://info.iet.unipi.it/~luigi/netmap/
Performance
• Saturate 10 Gbps pipe with low CPU frequency
From h,p://info.iet.unipi.it/~luigi/netmap/
netmap API (initialization)
• open(“/dev/netmap”) returns a file descriptor • ioctl(fd, NIOCREG, arg) puts an interface in netmap mode • mmap(…, fd, 0) maps buffers and rings
Page 6 From h,p://info.iet.unipi.it/~luigi/netmap/
netmap API (TX)
• TX – Fill up to avail buffers, starting from slot cur
– ioctl(fd, NIOCTXSYNC) queues the packets • poll() can be used for blocking I/O
Page 7 From h,p://info.iet.unipi.it/~luigi/netmap/
netmap API (RX)
• RX – ioctl(fd, NIOCTXSYNC) reports newly received packets
– Process up to avail buffers, starting from slot cur • poll() can be used for blocking I/O
Page 8 From h,p://info.iet.unipi.it/~luigi/netmap/
Other features
• Multi queue support – One netmap ring is initialized for a single physical ring
• e.g., different pthreads can be assigned to different netmap/physical rings
• Host stack support – NIC is put into netmap mode, ���
resetting its phy – The host stack still sees the interface,���
and packets can be sent to/from the���NIC via “software rings” • Either implicitly by the kernel or explicitly���
by the app
From h,p://info.iet.unipi.it/~luigi/netmap/
Implementation
• Available for FreeBSD and Linux – Linux code is glued from FreeBSD one
• Common code – control, systemcall backends, memory allocator etc
• Device-specific code – Each of the supported drivers implements some functions
• nm_register(struct ifnet *ifp, int onoff)
– Put the NIC into netmap mode, allocate netmap rings and slots • nm_txsync(struct ifnet *ifp, u_int ring_nr)
– Flush out the packets from the netmap ring filled by the user • nm_rxsync(struct ifnet *ifp, u_int ring_nr)
– refil the netmap ring with the receiving packets
Implementation (cont.)
• Small modifications in device drivers
From h,p://info.iet.unipi.it/~luigi/netmap/
VALE (MSWITCH): A NETMAP-BASED, VERY FAST SOFTWARE SWITCH
Software switch
• Switching packets between network interfaces – General-purpose OS and processor
– Virtual and hardware interfaces
• 20 years ago – Prototyping, low-performance alternative
• Now and near future – Replacement of hardware switch
– Hosting virtual machines (incl. virtual ���network functions)
Software switch
Page 13
Performance of today’s software switch
• Forward packets from a 10Gbps NIC to another one – Xeon E5-1650 (3.8Ghz) (1 CPU core is used)
– Lower than 1 Gbps for the minimum-sized packets
1 2
5
10
64 128 256 512 1024
Thro
ughput
(Gbps)
Packet size (Bytes)
FreeBSD bridge Openvswitch
Page 14
Problems
1. Inefficient packet I/O mechanism – Today’s software switches use a dynamically-allocated, big ���
metadata (e.g., sk_buff) designed for end systems – it should be simplified, because the packet is just forwarded
– For switches it is more important to process small packets efficiently
2. Inefficient packet switching algorithm – How to move packets from the source���
to the destination(s) efficiently?
– Traditional way • Lock a destination, send a single packet, then unlock the destination • Inefficient due to locking cost/contention
1 2
3 4
Page 15
Problems (cont.)
3. Lack of flexibility in packet processing – How to decide packet’s destination?
– One could use layer 2 learning bridge to decide packet’s destination
– One could use OpenFlow packet matching to do so
packet processing
Page 16
Solutions
1. “Inefficient packet I/O mechanisms” – Simple, minimalistic packet representation (netmap API*)
• No metadata allocation cost • Reduced cache pollution
2. “Inefficient packet switching” – Group multiple packets going to the same���
destination
– Lock the destination only once for a group of packets
1 2
3 4
Netmap – a novel frameworl for fast packet I/O h,p://info.iet.unipi.it/~luigi/netmap/ Luigi Rizzo Università di Pisa
Page 17
Bitmap-based forwarding algorithm
• Algorithm in the original VALE • Support for unicast, multicast and broadcast – Get pointers to a batch of packets
– Identify the destination of each ���packet and represent as a bitmap
– Lock each destination, and send all ���the packets going there
• Problem – Scalability issue in the presence of many destinations
VALE, a Virtual Local Ethernet h,p://info.iet.unipi.it/~luigi/vale/ Luigi Rizzo, Giuseppe LeReri Università di Pisa
pkt idp0p1p2p3p4
dst00100001001011110010
00100001001011110010
Figure 3. Bitmap-based packet forwarding algorithm: pack-ets are labeled from p0 to p4; for each packet, destination(s)are identified and represented in a bitmap (a bit for each pos-sible destination port). The forwarder considers each desti-nation port in turn, scanning the corresponding column ofthe bitmap to identify the packets bound to the current desti-nation port.
pkt idp0p1p2p3p4
dst nextp1 ...p1
p0p4
p3p3
dsthead
taild1d0d1
d254d1
p2nullp4nullnull
pkt idp0p1p2p3p4
dst nextp1 ...
d0 d254
p1p0p0
dsthead
taild1d0
nullnull
d1 d2
d0 d254d1 d2
Figure 4. List-based packet forwarding: packets are labeledfrom p0 to p4, destination port indices are labeled from d1to d254 (254 means broadcast). The sender’s “next” field de-notes the next packet in a batch, the receiver’s list has “head”and “tail” pointers. When the destinations for all packetshave been identified (top part of graph), the forwarder serveseach destination by scanning the destination’s private list andthe broadcast one (bottom of graph).
bottleneck. Cache pollution might be significant, so we mayrevise this decision in the future.
Copies are instead the best option when switching packetsfrom/to virtual ports: buffers cannot be shared between vir-tual ports for security reasons, and altering page mappingsto transfer ownership of the buffers is immensely more ex-pensive.
4.2 Packet Forwarding AlgorithmThe original VALE switch operates on batches of packetsto improve efficiency, and groups packets to the same desti-
nation before forwarding so that locking costs can be amor-tized. The grouping algorithm uses a bitmap (see Figure 3) toindicate packets’ destinations; this is an efficient way to sup-port multicast, but has a forwarding complexity that is linearin the number of connected ports (even for unicast traffic andin the presence of idle ports). As such, this does not scalewell for configurations with a large number of ports.
To reduce this complexity, in mSwitch we only allow uni-cast or broadcast packets (multicast is mapped to broadcast).This lets us replace the bitmap with N+1 lists, one per des-tination port plus one for broadcast traffic (Figure 4). In theforwarding phase, we only need to scan two lists (one for thecurrent port, one for broadcast), which makes this a constanttime step, irrespective of the number of ports. Figure 8 and9 show the performance gains.
4.3 Improving Output ParallelismOnce the list of packets destined to a given port has beenidentified, the packets must copied to the destination portand made available on the ring. In the original VALE imple-mentation, the sending thread locks the destination port forthe duration of the entire operation; this is done for a batchof packets, not a single one, and so can take relatively long.
Instead, mSwitch improves parallelism by operating intwo phases: a sender reserves (under lock2) a sufficientlylarge contiguous range of slots in the output queue, then re-leases the lock during the copy, allowing concurrent opera-tion on the queue, and finally acquires the lock again to ad-vance queue pointers. This latter phase also handles out-of-order completions of the copy phases, which may occur formany reasons (different batch or packet sizes, cache misses,page faults). With this new strategy, the queue is locked onlyfor the short intervals needed to reserve slots and updatequeue pointers. As an additional side benefit, we can nowtolerate page faults during the copy phase, which allows us-ing userspace buffers as data sources. We provide evaluationresults for this mechanism in Section 5.3.
4.4 Pluggable Packet ProcessingVALE uses a hard-wired packet processing function thatimplements a learning Ethernet bridge. In mSwitch, eachswitch instance can configure its own packet processingfunction by loading a suitable kernel module which imple-ments it. The packet processing function is called once foreach packet in a batch, before the actual forwarding takesplace, and the response is used to add the packet to the uni-cast or broadcast lists discussed in Section 4.2. The functionreceives a pointer to the packet’s buffer and its length, andshould return the destination port (with special values in-dicating “broadcast” or “drop”). The function can performarbitrary actions on the packet, including modifying its con-tent or length, within the size of the buffer. Speed is not a
2 The spinlock could be replaced by a lock-free scheme, but we doubt thiswould provide any measurable performance gain.
List-based forwarding algorithm
• Algorithm in the current VALE (mSwitch) • Support for unicast and broadcast – Make a linked-list for each destination
– Broadcast packets are���mapped into destination ���index 254
– Scan each destination,���and broadcast packets���are inserted in-order
pkt idp0p1p2p3p4
dst00100001001011110010
00100001001011110010
Figure 3. Bitmap-based packet forwarding algorithm: pack-ets are labeled from p0 to p4; for each packet, destination(s)are identified and represented in a bitmap (a bit for each pos-sible destination port). The forwarder considers each desti-nation port in turn, scanning the corresponding column ofthe bitmap to identify the packets bound to the current desti-nation port.
pkt idp0p1p2p3p4
dst nextp1 ...p1
p0p4
p3p3
dsthead
taild1d0d1
d254d1
p2nullp4nullnull
pkt idp0p1p2p3p4
dst nextp1 ...
d0 d254
p1p0p0
dsthead
taild1d0
nullnull
d1 d2
d0 d254d1 d2
Figure 4. List-based packet forwarding: packets are labeledfrom p0 to p4, destination port indices are labeled from d1to d254 (254 means broadcast). The sender’s “next” field de-notes the next packet in a batch, the receiver’s list has “head”and “tail” pointers. When the destinations for all packetshave been identified (top part of graph), the forwarder serveseach destination by scanning the destination’s private list andthe broadcast one (bottom of graph).
bottleneck. Cache pollution might be significant, so we mayrevise this decision in the future.
Copies are instead the best option when switching packetsfrom/to virtual ports: buffers cannot be shared between vir-tual ports for security reasons, and altering page mappingsto transfer ownership of the buffers is immensely more ex-pensive.
4.2 Packet Forwarding AlgorithmThe original VALE switch operates on batches of packetsto improve efficiency, and groups packets to the same desti-
nation before forwarding so that locking costs can be amor-tized. The grouping algorithm uses a bitmap (see Figure 3) toindicate packets’ destinations; this is an efficient way to sup-port multicast, but has a forwarding complexity that is linearin the number of connected ports (even for unicast traffic andin the presence of idle ports). As such, this does not scalewell for configurations with a large number of ports.
To reduce this complexity, in mSwitch we only allow uni-cast or broadcast packets (multicast is mapped to broadcast).This lets us replace the bitmap with N+1 lists, one per des-tination port plus one for broadcast traffic (Figure 4). In theforwarding phase, we only need to scan two lists (one for thecurrent port, one for broadcast), which makes this a constanttime step, irrespective of the number of ports. Figure 8 and9 show the performance gains.
4.3 Improving Output ParallelismOnce the list of packets destined to a given port has beenidentified, the packets must copied to the destination portand made available on the ring. In the original VALE imple-mentation, the sending thread locks the destination port forthe duration of the entire operation; this is done for a batchof packets, not a single one, and so can take relatively long.
Instead, mSwitch improves parallelism by operating intwo phases: a sender reserves (under lock2) a sufficientlylarge contiguous range of slots in the output queue, then re-leases the lock during the copy, allowing concurrent opera-tion on the queue, and finally acquires the lock again to ad-vance queue pointers. This latter phase also handles out-of-order completions of the copy phases, which may occur formany reasons (different batch or packet sizes, cache misses,page faults). With this new strategy, the queue is locked onlyfor the short intervals needed to reserve slots and updatequeue pointers. As an additional side benefit, we can nowtolerate page faults during the copy phase, which allows us-ing userspace buffers as data sources. We provide evaluationresults for this mechanism in Section 5.3.
4.4 Pluggable Packet ProcessingVALE uses a hard-wired packet processing function thatimplements a learning Ethernet bridge. In mSwitch, eachswitch instance can configure its own packet processingfunction by loading a suitable kernel module which imple-ments it. The packet processing function is called once foreach packet in a batch, before the actual forwarding takesplace, and the response is used to add the packet to the uni-cast or broadcast lists discussed in Section 4.2. The functionreceives a pointer to the packet’s buffer and its length, andshould return the destination port (with special values in-dicating “broadcast” or “drop”). The function can performarbitrary actions on the packet, including modifying its con-tent or length, within the size of the buffer. Speed is not a
2 The spinlock could be replaced by a lock-free scheme, but we doubt thiswould provide any measurable performance gain.
Solutions (cont.)
3. “Lack of flexibility in packet processing” – Separate a switching fabric ���
and packet processing – Switching fabric
• Move packets quickly
– Packet processing • Decide packets’ destination and tell the switching fabric
– Return • The index of the destination port for unicast • NM_BDG_BROADCAST for broadcast • NM_BDG_NOPORT for dropping this packet
– By default L2 learning is set
Packet processing
Switching fabric
Page 20
typedef u_int (*BDG_LOOKUP_T)(char *buf, u_int len, uint8_t *ring_nr, struct netmap_adapter *srcif);
VALE (mSwitch) architecture
• Packet forwarding (identifying packets’ destination (packet processing) and copying packets to the destination ring) takes place in the sender’s context – The receiver just consumes the packets
app1/vm1
Switching fabricNIC
Packet processing
Kernel
Userapps
OS stackVirtual
interfaces
Socket API
appN/vmN
Net
map
API
Net
map
API
. . .
. . .
Page 21
The other features
• Indirect buffer support – netmap slot can contain a pointer to the actual buffer
– Useful to eliminate data copy from VM’s backend to a netmap slot
• Support for a large packet – Multiple netmap slots (by default 2048 byte each) can be
used to contain a single packet
Bare mSwitch performance
• NIC to NIC (10Gbps) • “Dummy” processing module
2
4
6
8
10
64 128 256 512 1518
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
10Gbps line rate
Dummy
Page 23
5
10
15
20
1.3 1.9 2.6 3.2 3.8
Th
rou
gh
pu
t (G
bp
s)
CPU Clock Frequency (Ghz)
64B 128B 256B
(a) NIC to NIC
5
10
15
20
1.3 1.9 2.6 3.2 3.8
Th
rou
gh
pu
t (G
bp
s)
CPU Clock Frequency (Ghz)
64B 128B 256B
(b) NIC to virtual interface
5
10
15
20
1.3 1.9 2.6 3.2 3.8
Th
rou
gh
pu
t (G
bp
s)
CPU Clock Frequency (Ghz)
64B 128B 256B
(c) Virtual interface to NIC
Figure 5. Throughput between 10 Gb/s NICs and virtual ports for different packet sizes and CPU frequencies. Solid lines arefor a single NIC, dashed ones for two.
(Intel Xeon [email protected], 16GB of RAM) with a sim-ilar card. Unless otherwise stated, we run the CPU of themSwitch server at 3.8 GHz. In terms of operating system,we rely on FreeBSD 10 3 for most experiments, and Linux3.9 for the Open vSwitch experiments in the next section.To generate and count packets we use pkt-gen, a fast gen-erator that uses the netmap API and so can be plugged intomSwitch’s virtual ports. Throughout, we use Gb/s to meanGigabits per second, Mp/s for millions of packets per sec-ond, and unless otherwise stated we use a batch size of 1024packets.
5.1 Basic performanceFor the first experiments, and to derive a set of baseline per-formance figures, we implemented a dummy plug-in mod-ule. The idea is that the source port is configured to markpackets with a static destination port. From there, the packetgoes to the dummy module, which returns immediately (thusgiving us a base figure for how much it costs for packets togo through the entire switch’s I/O path), and then the switchsends the packet to the destination port.
We evaluate mSwitch’s throughput for different packetsizes and combinations of NICs and virtual ports. We furthervary our CPU’s frequency by either using Turbo Boost toincrease it or a sysctl to decrease it; this lets us shed lighton CPU-bound bottlenecks.
To begin, we connect the two 10 Gb/s ports of the NIC tothe switch, and have it forward packets from one to the otherusing a single CPU core (figure 5(a)). With this setup, weobtain 10 Gb/s line rate for all CPU frequencies for 256-bytepackets or larger, and line rate for 128-byte ones starting at2.6 GHz. Minimum-sized packets are more CPU-intensive,requiring us to enable Turbo Boost to reach 96% of line rate(6.9/7.1 Gb/s or 14.3/14.9 Mp/s). In a separate experiment
3 mSwitch can also run in Linux.
we confirmed that this small deviation from line rate was aresult of our lower-frequency, receiver server.
For the next experiment we attach the two NIC ports andtwo virtual ports to the switch, and conduct NIC-to-virtualport tests (10 Gb/s ones using one pair of NIC and virtualport, and 20 Gb/s ones with two pairs). In addition, we as-sign one CPU core per port pair. The results in figure 5(b)are similar to those in the NIC-to-NIC case, with line ratevalues for all packet sizes at 3.8 GHz. The graph also showsthat mSwitch scales well with the number of ports and CPUcores: we achieve line rate for two 10 Gb/s ports for allpacket sizes. Finally, figure 5(c) presents throughput num-bers in the opposite direction, from virtual ports to NICs,with similar results.
25
50
100
150
200
60 508 1514 8K 64K
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
1 CPU core2 CPU cores3 CPU cores
Figure 6. Forwarding performance between two virtualports and different number of CPU cores and rings per port.
The previous set of experiments all involved NICs. To getan idea of mSwitch’s raw switching performance, we attacha pair of virtual ports to the switch and forward packets
Experiments are done with Xeon E5-‐1650 CPU (6 core, 3.8 Ghz with Turboboost), 16GB DDR3 RAM (quad channel) and Intel X520-‐T2 10Gbps NICs
Bare mSwitch performance • Virtual port to virtual port • “Dummy” processing module
5
10
15
20
1.3 1.9 2.6 3.2 3.8T
hro
ughput
(Gbps)
CPU Clock Frequency (Ghz)
64B 128B 256B
(a) NIC to NIC
5
10
15
20
1.3 1.9 2.6 3.2 3.8
Thro
ughput
(Gbps)
CPU Clock Frequency (Ghz)
64B 128B 256B
(b) NIC to virtual interface
5
10
15
20
1.3 1.9 2.6 3.2 3.8
Thro
ughput
(Gbps)
CPU Clock Frequency (Ghz)
64B 128B 256B
(c) Virtual interface to NIC
Figure 5. Throughput between 10 Gb/s NICs and virtual ports for different packet sizes and CPU frequencies. Solid lines arefor a single NIC, dashed ones for two.
(Intel Xeon [email protected], 16GB of RAM) with a sim-ilar card. Unless otherwise stated, we run the CPU of themSwitch server at 3.8 GHz. In terms of operating system,we rely on FreeBSD 10 3 for most experiments, and Linux3.9 for the Open vSwitch experiments in the next section.To generate and count packets we use pkt-gen, a fast gen-erator that uses the netmap API and so can be plugged intomSwitch’s virtual ports. Throughout, we use Gb/s to meanGigabits per second, Mp/s for millions of packets per sec-ond, and unless otherwise stated we use a batch size of 1024packets.
5.1 Basic performanceFor the first experiments, and to derive a set of baseline per-formance figures, we implemented a dummy plug-in mod-ule. The idea is that the source port is configured to markpackets with a static destination port. From there, the packetgoes to the dummy module, which returns immediately (thusgiving us a base figure for how much it costs for packets togo through the entire switch’s I/O path), and then the switchsends the packet to the destination port.
We evaluate mSwitch’s throughput for different packetsizes and combinations of NICs and virtual ports. We furthervary our CPU’s frequency by either using Turbo Boost toincrease it or a sysctl to decrease it; this lets us shed lighton CPU-bound bottlenecks.
To begin, we connect the two 10 Gb/s ports of the NIC tothe switch, and have it forward packets from one to the otherusing a single CPU core (figure 5(a)). With this setup, weobtain 10 Gb/s line rate for all CPU frequencies for 256-bytepackets or larger, and line rate for 128-byte ones starting at2.6 GHz. Minimum-sized packets are more CPU-intensive,requiring us to enable Turbo Boost to reach 96% of line rate(6.9/7.1 Gb/s or 14.3/14.9 Mp/s). In a separate experiment
3 mSwitch can also run in Linux.
we confirmed that this small deviation from line rate was aresult of our lower-frequency, receiver server.
For the next experiment we attach the two NIC ports andtwo virtual ports to the switch, and conduct NIC-to-virtualport tests (10 Gb/s ones using one pair of NIC and virtualport, and 20 Gb/s ones with two pairs). In addition, we as-sign one CPU core per port pair. The results in figure 5(b)are similar to those in the NIC-to-NIC case, with line ratevalues for all packet sizes at 3.8 GHz. The graph also showsthat mSwitch scales well with the number of ports and CPUcores: we achieve line rate for two 10 Gb/s ports for allpacket sizes. Finally, figure 5(c) presents throughput num-bers in the opposite direction, from virtual ports to NICs,with similar results.
25
50
100
150
200
60 508 1514 8K 64K
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
1 CPU core2 CPU cores3 CPU cores
Figure 6. Forwarding performance between two virtualports and different number of CPU cores and rings per port.
The previous set of experiments all involved NICs. To getan idea of mSwitch’s raw switching performance, we attacha pair of virtual ports to the switch and forward packets
Dummy
Page 24
Bare mSwitch performance
• Dummy packet processing module” • N virtual ports to N virtual ports • “
unicast
broadcast (a) Experiment topologies.
0
50
100
150
200
250
2 4 6 8
Th
rou
gh
pu
t (G
bp
s)
# of ports
64B packets1514B packets64KB packets
(b) Unicast throughput.
0
50
100
150
200
250
2 4 6 8
Th
rou
gh
pu
t (G
bp
s)
# of ports
64B packets1514B packets64KB packets
(c) Broadcast throughput.
Figure 7. Switching capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a singleCPU core, for broadcast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign coresin a round-robin fashion.
between them. We also leverage mSwitch’s ability to assignmultiple packet rings to each port (and a CPU core to eachof the rings) to further scale performance. Figure 6 showsthroughput when assigning an increasing number of CPUcores to each port (up to 3 per port, at which point all 6cores in our system are in use). We see a rather high rate ofabout 185.0 Gb/s for 1514-byte packets and 25.6 Gb/s (53.3Mp/s) for minimum-sized ones. We also see a peak of 215.0Gb/s for 64KB “packets”; given that we are not limited bymemory bandwidth (our 1333MHz quad channel memoryhas a maximum theoretical rate of 10.6 GB/s per channel, soroughly 339 Gb/s in total), we suspect the limitation to beCPU frequency.
5.2 Switching ScalabilityHaving tested mSwitch’s performance with NICs attached,as well as between a pair of virtual ports, we now investigatehow its switching capacity scales with additional numbers ofvirtual ports.
We begin by testing unicast traffic, as shown in figure 7(a)(top). We use an increasing number of port pairs, each pairconsisting of a sender and a receiver. Each pair is handled bya single thread which we pin to a separate CPU core (as longas there are more cores than port pairs; when that is not thecase, we pin more than one pair of ports to a CPU core in around-robin fashion).
Figure 7(b) shows the results. We observe high switch-ing capacity. For minimum-sized packets, mSwitch achievesrates of 26.2 Gb/s (54.5 Mp/s) with 6 ports, a rate that de-creases slightly down to 22.5 Gb/s when we start sharingcores among ports (8 ports). For 1514-byte packets we seea maximum rate of 172 Gb/s with 4 ports why does this godown for 6 ports?, down to 136.5 Gb/s for 8 ports.
For the second test, we perform the same experimentbut this time each sender transmits broadcast packets (fig-ure 7(a), bottom). An mSwitch sender is slower than a re-
ceiver because packet switching and processing happen inthe sender’s context. This experiment is thus akin to hav-ing more senders, meaning that the receivers should beless idle than in the previous experiment and thus cumu-lative throughput should be higher (assuming no other bot-tlenecks).
This is, in fact, what we observe (figure 7(c)): in thiscase, mSwitch yields a rate of 46 Gb/s (96 Mp/s) for 64-byte packets when using all 6 cores, going down to 38 Gb/swhen sharing cores among 8 ports. 1514-byte packets resultin a maximum rate of 205.4 Gb/s for 6 ports and 170.5 Gb/sfor 8 ports. As expected, all of these figures are higher thanin the unicast case.
9
10
11
12
13
1 2 3 4 5Aggre
gat
e th
roughput
(Gbps)
# of destination ports
List Algo.Bitmap Algo.
Figure 8. Packet forwarding throughput from a singlesource port to multiple destinations using minimum-sizedpackets. The graph compares mSwitch’s forwarding algo-rithm (list) to VALE’s (bitmap).
Next, we evaluate the cost of having a single source sendpackets to an increasing number of destination ports, up to5 of them, at which point all 6 cores in our system are inuse. Figure 8 shows aggregate throughput from all destina-
Page 25
mSwitch’s Scalability
• A single virtual port to ���many virtual ports – Bitmap- vs List-based���
algorithm
– List-based algorithm���scales very well
unicast
broadcast (a) Experiment topologies.
0
50
100
150
200
250
2 4 6 8
Th
rou
gh
pu
t (G
bp
s)
# of ports
60B packets1514B packets64KB packets
(b) Unicast throughput.
0
50
100
150
200
250
2 4 6 8
Th
rou
gh
pu
t (G
bp
s)
# of ports
60B packets1514B packets64KB packets
(c) Broadcast throughput.
Figure 7. Switching capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a singleCPU core, for broadcast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign coresin a round-robin fashion.
9
10
11
12
13
1 2 3 4 5Ag
gre
gat
e th
rou
gh
pu
t (G
bp
s)
# of destination ports
List Algo.Bitmap Algo.
Figure 8. Packet forwarding throughput from a singlesource port to multiple destinations using minimum-sizedpackets. The graph compares mSwitch’s forwarding algo-rithm (list) to VALE’s (bitmap).
ments over the bitmap algorithm, which yields throughputsof 12.1 Gbps and 9.8 Gbps respectively.
In the final baseline test we compare mSwitch’s algorithmto that of VALE in the presence of large numbers of ports.In order to remove effects arising from context switches andthe number of cores in our system from the comparison, weemulate the destination ports as receive packet queues thatautomatically empty once packets are copied to them. Thismechanism incurs most of the costs of forwarding packets tothe destination ports without being bottlenecked by having towait for the destination ports’ context/CPU core to run. Wefurther use minimum-sized packets and a single CPU core.
Figure 9 clearly shows the advantage of the new algo-rithm: we see a linear decrease with increasing number ofports as opposed to an exponential one with the bitmap al-gorithm. As a point of comparison, for 60 ports the list algo-rithm yields a rate of 11.1 Gbps (23.2 Mpps) versus 3.0 Gbps
0
2
4
6
8
10
12
14
1 20 40 60 100 150 200 250
Ag
gre
gat
e th
rou
gh
pu
t (G
bp
s)
# of destination ports
List Algo.Bitmap Algo.
Figure 9. Comparison of mSwitch’s forwarding algorithm(list) to that of VALE (bitmap) in the presence of a largenumber of active destination ports (single sender, minimum-sized packets). As ports increase, each destination receivesa smaller batch of packets in each round hence reducing theefficiency of batch processing.
(6.4 Mpps) for the bitmap one. This is rather a large im-provement, especially if we consider that the list algorithmsupports much larger numbers of ports (the bitmap algorithmsupports up to 64 ports only; the graph shows only up to 60for simplicity of illustration). In the presence of 250 des-tination ports mSwitch is still able to produce a rather re-spectable 7.2 Gbps forwarding rate.
5.3 Output Queue ManagementWe now evaluate the performance gains derived from usingmSwitch’s parallelization in handling output queues. Recallfrom Section 4.3 that mSwitch allows concurrent copies to
Page 26
We use the minimum-‐sized packets with a single CPU core
Learning bridge performance
• mSwitch-learn: pure learning bridge processing – Adding a cost of MAC address hashing at packet processing
2
4
6
8
10
64 128 256 512 1024
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
FreeBSD bridgemSwitch-learn
(a) Layer 2 learning bridge.
2
4
6
8
10
64 128 256 512 1024
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
mSwitch-3-tuple
(b) 3-tuple filter (user-space nw stack support).
2
4
6
8
10
64 128 256 512 1024
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
OVSmSwitch-OVS
(c) Open vSwitch.
Figure 12. Packet forwarding performance for different mSwitch modules and packet sizes. For the learning bridge and OpenvSwitch module we compare them to their non-mSwitch versions (FreeBSD bridge and standard Open vSwitch, respectively).
ment a filter module able to direct packets between mSwitchports based on the <dst IP, protocol type,dst port>
3-tuple. User-level processes (i.e., the stacks) request the useof a certain 3-tuple, and if the request does not collide withprevious ones, the module inserts the 3-tuple entry, alongwith a hash of it and the switch port number that the processis connected to, into its routing table. On packet arrival, themodule ensures that packets are IPv4 and that their check-sum is correct. If so, it retrieves the packet’s 3-tuple, andmatches a hash of it with the entries in the table. If there is amatch, the packet is forwarded to the corresponding switchport, otherwise it is dropped. The module also includes addi-tional filtering to make sure that stacks cannot spoof packetsand consists of about 200 lines of code in all, excludingprotocol-related data structures.
To be able to analyze the performance of the moduleseparately from that of any user-level stack that may makeuse of it, we run a test that forwards packets through themodule and between two 10 Gb/s ports; this functionalityrepresents the same costs in mSwitch and the module thatwould be incured if an actual stack were connected to a port.Figure 12(b) shows throughput results. We see a high rate of5.8 Gb/s (12.1 Mp/s or 81% of line rate) for 64-byte packets,and full line rate for larger packet sizes, values that are onlyslightly lower than those from the learning bridge module.
To take this one step further, we built a simple user-level TCP stack library along with an HTTP server builton top of it. Using these, and relying on mSwitch’s 3-tuplemodule for back-end support, we were able to achieve HTTPrates of up to 8 Gb/s for 16KB and higher fetch sizes, andapproximately 100 K requests per second for short requests.
6.3 Open vSwitchIn our final use case, we attempt to see whether it is simpleto port existing software packages to mSwitch, and whether
Openvswitch datapath kernel module
Device-specific receive
Commonreceive
Packet matching
Device-specific send
Commonsend
NIC NIC
datapath
Figure 13. Overview of Open vSwitch’s datapath.
doing so provides a performance boost. To this end we targetOpen vSwitch, modifying code of the version in the 3.9.0Linux kernel in order to adapt it to mSwitch; we use the termmSwitch-OVS to refer to this mSwitch module.
At the end we modified 476 lines of code (LoC). Changesin the original files are only 59 LoC, which means followingfuture update of Open vSwitch should be easy. 201 LoC isfor support for netmap-mode with “internal-dev” which isan interface type of Open vSwitch to be connected by thehost network stack. The rest of modifications include controlfunctions that instructs mSwitch instance (e.g., attaching aninterface) based on control commands of Open vSwitch, aglue code between Open vSwitch’s packet processing andmSwitch’s lookup routine, and the lookup routine itself.
As a short background, figure 13 illustrates Open vSwitch’sdatapath. The datapath typically runs in the kernel, and in-cludes one or more NICs or virtual interfaces as ports; itis, at this high level, similar to mSwitch’s datapath (recallfigure 2). The datapath takes care of receiving packets on
Learning bridge
Page 27
OpenVswitch acceleration
• mSwitch-OVS: mSwitch with Openswitch’s packet processing
2
4
6
8
10
64 128 256 512 1024
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
FreeBSD bridgemSwitch-learn
(a) Layer 2 learning bridge.
2
4
6
8
10
64 128 256 512 1024
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
mSwitch-3-tuple
(b) 3-tuple filter (user-space nw stack support).
2
4
6
8
10
64 128 256 512 1024
Th
rou
gh
pu
t (G
bp
s)
Packet size (Bytes)
OVSmSwitch-OVS
(c) Open vSwitch.
Figure 12. Packet forwarding performance for different mSwitch modules and packet sizes. For the learning bridge and OpenvSwitch module we compare them to their non-mSwitch versions (FreeBSD bridge and standard Open vSwitch, respectively).
ment a filter module able to direct packets between mSwitchports based on the <dst IP, protocol type,dst port>
3-tuple. User-level processes (i.e., the stacks) request the useof a certain 3-tuple, and if the request does not collide withprevious ones, the module inserts the 3-tuple entry, alongwith a hash of it and the switch port number that the processis connected to, into its routing table. On packet arrival, themodule ensures that packets are IPv4 and that their check-sum is correct. If so, it retrieves the packet’s 3-tuple, andmatches a hash of it with the entries in the table. If there is amatch, the packet is forwarded to the corresponding switchport, otherwise it is dropped. The module also includes addi-tional filtering to make sure that stacks cannot spoof packetsand consists of about 200 lines of code in all, excludingprotocol-related data structures.
To be able to analyze the performance of the moduleseparately from that of any user-level stack that may makeuse of it, we run a test that forwards packets through themodule and between two 10 Gb/s ports; this functionalityrepresents the same costs in mSwitch and the module thatwould be incured if an actual stack were connected to a port.Figure 12(b) shows throughput results. We see a high rate of5.8 Gb/s (12.1 Mp/s or 81% of line rate) for 64-byte packets,and full line rate for larger packet sizes, values that are onlyslightly lower than those from the learning bridge module.
To take this one step further, we built a simple user-level TCP stack library along with an HTTP server builton top of it. Using these, and relying on mSwitch’s 3-tuplemodule for back-end support, we were able to achieve HTTPrates of up to 8 Gb/s for 16KB and higher fetch sizes, andapproximately 100 K requests per second for short requests.
6.3 Open vSwitchIn our final use case, we attempt to see whether it is simpleto port existing software packages to mSwitch, and whether
Openvswitch datapath kernel module
Device-specific receive
Commonreceive
Packet matching
Device-specific send
Commonsend
NIC NIC
datapath
Figure 13. Overview of Open vSwitch’s datapath.
doing so provides a performance boost. To this end we targetOpen vSwitch, modifying code of the version in the 3.9.0Linux kernel in order to adapt it to mSwitch; we use the termmSwitch-OVS to refer to this mSwitch module.
At the end we modified 476 lines of code (LoC). Changesin the original files are only 59 LoC, which means followingfuture update of Open vSwitch should be easy. 201 LoC isfor support for netmap-mode with “internal-dev” which isan interface type of Open vSwitch to be connected by thehost network stack. The rest of modifications include controlfunctions that instructs mSwitch instance (e.g., attaching aninterface) based on control commands of Open vSwitch, aglue code between Open vSwitch’s packet processing andmSwitch’s lookup routine, and the lookup routine itself.
As a short background, figure 13 illustrates Open vSwitch’sdatapath. The datapath typically runs in the kernel, and in-cludes one or more NICs or virtual interfaces as ports; itis, at this high level, similar to mSwitch’s datapath (recallfigure 2). The datapath takes care of receiving packets on
Openvswitch
Page 28
Conclusion
• Our contribution – VALE(mSwitch): fast, modular software switch – Very fast packet forwarding at bare metal
• > 200 Gbps between virtual ports (with 1500 Byte packets and 3 CPU cores)
• Almost the line rate with using 1 CPU core and 2 10 Gbps NICs – Useful to implement various systems
• Very fast learning bridge • Accelerate OpenVswitch up to 2.6 times
– Small modifications, preserving control interface • Fast protocol multiplexer/demultiplexer for user-space protocol stacks
• Code (Linux, FreeBSD) is available at: – http://info.iet.unipi.it/~luigi/netmap/
Page 29