Qosmos* Deep Packet Inspection Characterization - Intel · PDF fileWhite Paper: Qosmos* Deep Packet Inspection Characterization 5 2 Use-Case Details 2.1 Background on DPI Workload

Executive Summary

Packet classification is an essential part of most (if not all) network functions. It is the process of associating packets with identifiers by analyzing the layers of the protocol stack up to, but not including, the application layer (see Figure 1). Actions taken by a VNF are based on these identifiers. Since different applications can use the same protocol parameters from the perspective of the transport and network layer (i.e., same IP addresses, protocol IDs and ports), it is impossible to reliably distinguish between them with this type of packet classification.

In contrast, Deep Packet Inspection (DPI) is concerned with analyzing not only the headers up to the application layer, but also the application layer itself. The classification result is at a higher level of detail. It allows the implementation of detailed network analytics and fine-grained policies. Service providers are interested in this since it allows them to improve network performance and implement services to improve end-user quality of experience. More specifically, bandwidth costs can be reduced, and fine-grained congestion control can be applied. DPI is becoming increasingly relevant in NFV due to its applicability in many well-known network functions.

white paper

Qosmos* Deep PacketInspection Characterization

Figure 1. Layers of the protocol stack.

Table of Contents

Executive Summary . . . . . . . . . . . . . . . .1

1 Performance Characterization Approach . . . . . . . . . . . . . . . . . . . . . .3

2 Use-Case Details . . . . . . . . . . . . . . .5

2.1 Background on DPI Workload . . . . . . . . . . . . . . . . . . .5

2.2 Intuition on DPI Throughput Performance . . . . . . . . . . . . . . . .5

2.3 System Setup . . . . . . . . . . . . . . .5

2.4 Traffic Profiles . . . . . . . . . . . . . . .6

2.4.1 Traffic Profile 1 . . . . . . . . . .6

2.4.2 Traffic Profile 2 . . . . . . . . . .8

2.4.3 Traffic Profile 3 . . . . . . . . 10

3 Test Results . . . . . . . . . . . . . . . . . . 12

3.1 Results for Networks Characterized by Traffic Profile 1 . . . . . . . . . . . . . . . . . . . 12



4 Appendix: Hardware and Software Details . . . . . . . . . . . . . . 21

This report characterizes a VNF that performs DPI. The DPI capability is provided by Qosmos* through IxEngine.* Qosmos provides a broad product portfolio: DPI engine as an SDK, DPI as a VNFc (VNF component), L7 Classifier for Open vSwitch* and a Service classifier. This report focuses on the DPI engine as an SDK solution meant for integration due to its broad applicability. For this, a prototype VNF was developed in which the SDK was used. The Prototype analyzes packets (DPI path), while at the same time forwarding it through the network (data path). The VNF does not take any action on the classification result, but still performs all the necessary steps for the classification. One way the classification result could be used is in a VNF that implements traffic shaping. Here, the classification result could be used to determine which of the queues within the traffic shaping algorithm will be used to buffer the given packet. A use-case for this is a VNF that prioritizes HTTP traffic over bittorrent.

The results in this report show DPI performance in function of the number of cores assigned to DPI tasks. The bars refer to the highest link utilization for which the DPI cores were handling at least 99.999% of all packets. Figure 2 shows the DPI performance for a VNF with 1, 2, 3, 4 and 7 DPI cores (2, 4, 6, 8 and 14 hyper-threads respectively) on an Intel® Xeon® processor E5-2690 v3 with 8 x 10GbE ports for one specific traffic profile. The fact that the maximum link utilization is below 100% has to do with the constraints of the traffic parameters as detailed in Section 1. For other results under different network characteristics, refer to Section 3.¹

2 White Paper: Qosmos* Deep Packet Inspection Characterization

Figure 2. DPI classification performance for a VNF configured with 1, 2, 4 and 7 DPI cores.The number of concurrent connections in steady state set to 400K.²

1 Tests conducted in this paper were conducted by Intel. Hardware configurations for all tests are detailed in the Appendix.2 See Section 3.1 for full configuration details for these results.

For most networking applications, like the example above, there tends to be more downstream traffic than upstream traffic (peer-to-peer being an exception). It is also for this reason that service providers support asymmetric SLAs (higher downstream bitrates, lower up stream bitrates). Since the load generated by flow blasters is also asymmetric since it is based on real traffic, it is expected that the link capacities are not reached in most cases.

Practically, it is too time consuming to measure VNF performance in all possible circumstances. Furthermore, measuring all possible data points would also result in many data points with little use. Therefore, the approach is to extract network applications including their characteristics (for example, the bitrates used by those applications) and the behavior of users from a packet capture (for example, the sequence of applications used). This information then forms the templates to simulate users at runtime taking into account the constraints of the configuration. Since the SUT influences the dynamics of the flows (i.e., it adds delay and it possibly drops packets causing TCP retransmits or other timeouts), simply replaying the pcap is not enough.

A traffic profile refers to all the characteristics of the traffic observed on the network. It defines the steady-state setup rate, and the distribution of flows (both their payload and their bitrates). Furthermore, it also defines the maximum number of concurrent connections. The maximum setup rate is an additional parameter that is relevant during the simulation. It determines the duration of the ramp up phase. After the traffic profile has been extracted from a pcap, the parameters are altered by the traffic generator to cover a broader range of networks for a more complete characterization of the SUT.

1 Performance Characterization Approach

There are two approaches to characterize a VNF by measuring its performance. The first approach is to place the VNF in a real production network, and the second approach is to simulate the network around the VNF through the use of traffic generators. The second solution is preferred due to the control it offers and due to its low cost. It is also the approach used in this report. The information in this section serves only as background. It is not required to interpret the results, but it might help to understand all the details.

Characterization is typically done by applying stateless load on a system under test (SUT) running the VNF workload. The constructed traffic contains packets based on a fixed set of template packets with randomized bit-patterns or a range of values written at pre-determined offsets. The transmission rate is fixed, possibly below line-rate (i.e., at 85% of line-rate). This is done by repeatedly inserting periods of silence on the wire. A characterization report then details the rates at which the VNF is able to perform its functions for the provided traffic. Other characterization details, like latency, could also be reported.

DPI inspects packets in context of flows. In many cases, it needs to inspect multiple packets before the classification is available. Clearly, loading the SUT that runs the DPI workload with the packets that do not carry any state (referred to as packet blasting) repeatedly is not sufficient to characterize performance. Instead, the SUT needs to be loaded with traffic consisting of flows (referred to as flow blasting). The important parameters for a given traffic profile are the maximum setup rate (reached mainly at the initial ramp up phase), the total number of concurrent connections and total bandwidth. The traffic profile itself also has an influence on the resource requirements for DPI.

The following sequence of packets demonstrates which packets need to be exchanged by the flow blaster to simulate a host visiting an HTTP webserver. A DNS request and a DNS reply is exchanged first. This is followed by a three-way TCP handshake to set up a connection to the HTTP server. The HTTP get-message is consequently sent by the initiator, and the HTTP reply containing the webpage is returned by the webserver. Finally, the TCP connection is torn down. The payload, speed and the number of packets is determined by the emulated applications and the network characteristics.

3White Paper: Qosmos* Deep Packet Inspection Characterization

Figure 3. Example of packets exchanged during DNS (left) and HTTP (right).Time flows from top to bottom. Slow shows network latency.

Due to the interdependences of all the parameters, the approach is to first numerically find the solution space boundaries for the maximum setup rate, maximum number of concurrent connections and the per connection bitrate. The first two parameters can be chosen arbitrarily. The last parameter is specified indirectly. A transformation function takes as input the distribution of connection bitrates and a scaling factor. The function scales all connections bitrates by the scaling factor unless the connection bitrate is limited by the link speed. This means that the bit-rate ratios are maintainedwhenever possible. As a consequence, the link utilization is increased. The reported link utilizations in this report are measured link utilizations and are the result of applying the transformation described above.

Figure 4 shows how the downstream bandwidth evolves over time for a maximum setup rate of 150K connections per second and the number of concurrent connections set to 800K. The first phase, P1, is the phase during which the number of concurrent connections for the test has not yet been reached. This phase will take at least “number of concurrent connections” divided by the “maximum setup rate” seconds. It will take longer in most cases due to connections being terminated during the phase itself. The second phase, P2, is a 10 second period (determined experimentally) during which the bandwidth peaks due to the following two reasons. First, the higher the setup rate, the more synchronized the connections are. This means that a higher setup rate will create a higher bandwidth peak. Second, on average, there is an increase in per-connection bitrate after the setup has completed since the setup is an exchange of a few small packets without any real data, i.e., the three-way TCP handshake. The


actual data transfer follows the setup (see the HTTP example from Figure 3). After P2, the bandwidth is in a steady state. This phase is the measurement phase for which the downstream bandwidth is reported in the results. It lasts for 120 seconds. The peak in P2 is the reason why the link utilization during steady state is lower than during the peak. The number of concurrent connections is shown in Figure 5. Note that the data shown by Figure 4 and Figure 5 is measured data.

In summary, since DPI operates at the 4th layer and beyond in the protocol stack, the SUT must be loaded with stateful packet streams based on a traffic profile extracted from real networks. The three parameters being changed for a traffic profile are the maximum setup rate, maximum concurrent connections and the link utilization. This report details the number of cores on an Intel® Xeon® processor E5-2690 v3 required to handle at least 99.999% of packets during the steady state.

Figure 4. Bandwidth in function of time during the test.

Figure 5. Number of concurrent connections in function of time during the test.

5

2 Use-Case Details

2.1 Background on DPI Workload

Depending on the protocol and network parameters, classification is possible only after analyzing multiple packets. For example, a HTTP request may be segmented at the TCP layer due to the maximum segment size (MSS) used during the TCP session. The DPI engine will need to inspect the payload of multiple packets before it can detect which web page is being transferred. For most applications, the classification will be finished before the tenth packet of the stream. Note that in the HTTP example, the DPI engine will not need to see all the packets since not all packets contain payload. Examples of this are packets exchanged during the three-way handshake of a TCP connection and empty ACK packets (see Figure 3).

The DPI functionality is provided through two components: an external flow table and the information extraction engine (IxEngine). The flow table associates packets with flows by using the five-tuple from the packet as a key into a hash table. It is also here that the offset to the fifth layer (see Figure 1) is determined. Together with some per-flow management data, the payload forms the input for the information extraction engine. The empty packets described above are not forwarded to the IxEngine. However, these packets still put pressure on the flow table since these packets postponed the flow entry timeout.

2.2 Intuition on DPI Throughput Performance

To study performance characterization of a DPI function, it is necessary to understand the amount of resources required to classify different types of traffic profiles. Resources in this case refers to CPU and memory. The most resource-intense traffic to classify consists of many short lived flows. The intuition behind this is that the DPI

engine needs to inspect a bigger fraction of the data within the short lived flow before it has finished classification. In addition, the result is only useful for a relatively short duration of time. Conversely, the resources required for DPI per unit of data for long lived flows is lower. These flows are sometimes referred to as mice flows and elephant flows respectively. Note that traffic on a real network will be a combination of both short lived flows and long lived flows.

2.3 System Setup

The test setup is shown by Figure 6. The client and server endpoints reside within the traffic generator systems. These are shown by clouds in the figure. Intel® Hyper-Threading Technology (Intel® HT Technology) provides two hardware threads per core. These hardware threads are referred to as logical cores. The boxes with “Mirror + Load dispatcher” and the stack of boxes with “DPI” (three in the example shown) in the figure represents a logical core. Packets are forwarded between pairs of ports through which a set of servers and a set of clients are connected (data path). The packets are also mirrored and dispatched to the logical DPI cores (dpi path). The DPI throughput rate (in terms of offered Gbps and consumed Gbps) is measured at this point in the system. In total, there are four pairs of groups of servers and clients (shown by different colors in the figure). The system has been configured with four dual port NICs with the Intel® 82599 10 Gigabit Ethernet Controller for a total of 8 ports. The number of logical DPI cores is always a multiple of two since cores are allocated in units of physical cores to the DPI workload and Intel® Hyper-Threading Technology provides two threads per core. Only one socket out of the two available on the Intel® Server Board S2600WTT are used in this report.

The logical cores are inter-connected through DPDK rings. The size of the rings connecting the load dispatcher and the DPI cores is set to 16384 to avoid packet loss due to the jitter inherently involved with the DPI workload (some packets take longer to process than others).

White Paper: Qosmos* Deep Packet Inspection Characterization

Figure 6. Test setup.


As noted in the appendix, the CPU used on the SUT is the Intel® Xeon® processor E5-2690 v3. To scale performance across multiple DPI cores, packets are dispatched based on the source and destination IP address and the source and destination port. This ensures that all packets belonging to the same flow are classified by the same logical DPI core. The net result of the load dispatchers is that all logical DPI cores receive approximately the same amount of traffic.

There are two components running on the DPI cores. At packet reception, the 5-tuple (i.e., source/destination IP address, IP protocol and source/destination UDP/TCP port) is extracted and it is used to look up a flow in the flow manager. If there are missing packets within a stream, the packet is simply dropped without going through classification. The reason for this is that missing data would lead to a misclassification and would lead to resources being wasted in the DPI engine. Note that determining if payload is missing is not possible for UDP through the flow manager. After the packet has been handled by the flow manager, the payload together with the stream context is sent to the DPI engine, which will perform the actual work. If this results in a successful classification, future packets from the same flow will not be sent to the IxEngine.

2.4 Traffic Profiles

As noted earlier in Section 1, due to the many possible parameters involved in traffic generation, the traffic parameters are extrapolated from a packet capture (pcap). These parameters are then used to recreate the traffic at run-time. Note that this is more complex than simply replaying the pcap. For the purpose of this document, three traffic profiles are considered.

2.4.1 Traffic Profile 1

The traffic consists of a mix of RTCP, bittorrent, SIP, RTSP, DNS, HTTPS, RTP, HTTP and other UDP flows. The cumulative distribution of upstream bitrates and downstream bitrates in units of bytes per second is shown by Figure 7. Figure 8 shows the distribution of flow payload and the distribution of the flow occurrences. Figure 9 shows the distribution of packet sizes.

Figure 7. Per stream bitrate (upstream and downstream).

Figure 8. Flow payload distribution (left) and flow occurrences distribution (right).


Table 1. Flow occurrences distribution and flow payload distributions.

Figure 9. Upstream and downstream packet size distribution.

PACKET SIZE UPSTREAM (%) DOWNSTREAM (%) OVERALL (%)

64 bytes 85.34 3.52 44.43

65-127 bytes 2.53 0.36 1.45

128-255 bytes 1.92 0.31 1.12

256-511 bytes 1.88 1.73 1.81

512-1023 bytes 2.69 1.46 2.08

1024+ bytes 5.64 92.60 49.12

FLOW TYPE OCCURRENCES (%) PAYLOAD (%)

RTCP 0.14 ~0

bittorrent 3.27 22.07

SIP 14.19 0.14

RTSP 3.56 0.02

DNS 14.19 ~0

UDP (other) 0.14 ~0

SSL/HTTPS 11.91 1.34

RTP 13.59 1.65

HTTP 38.99 74.78

Table 2. Upstream, downstream and overall packet size distributions.



The traffic consists of a mix of Bittorrent, HTTP, IMAP, POP3, SMTP, DNS, RTCP, RTP and SIP flows. The cumulative distribution of upstream bitrates and downstream bitrates in units of bytes per second is shown by Figure 10. Figure 11 shows the distribution of flow payload and the distribution of the flow occurrences. Figure 12 shows the distribution of packet sizes.





Bittorrent 0.20 4.79

HTTP 6.30 70.03

IMAP 4.47 3.64

POP3 12.25 4.83

SMTP 15.07 8.46

DNS 60.25 0.17

RTCP 0.06 0.01

RTP 0.69 7.98

SIP 0.70 0.08




64 bytes 37.38 9.61 23.49

65-127 bytes 19.89 17.37 18.63

128-255 bytes 32.73 19.44 26.09

256-511 bytes 2.36 2.22 2.29

512-1023 bytes 0.00 0.66 0.33

1024+ bytes 7.64 50.71 29.17




The traffic consists of a DNS, HTTP and HTTPS. The HTTP traffic can be categorized further into: Facebook,* Google,* Instagram,* iTunes,* Pandora,* YouTube,* and Other. The cumulative distribution of upstream bitrates and downstream bitrates in units of bytes per second is shown by Figure 13. Figure 14 shows the distribution of flow payload and the distribution of the flow occurrences. Figure 15 shows the distribution of packet sizes. This traffic profile represents wireless smartphone users from 2014.





HTTP Other 4.19 28.01

HTTP Facebook 0.65 11.35

HTTP Google 23.10 5.74

HTTP Instagram 6.39 7.05

HTTP iTunes 4.85 10.22

HTTP Pandora 0.02 7.24

HTTP YouTube 0.36 12.56

HTTPS 8.22 17.64

DNS 52.23 0.19





64 bytes 81.87 3.94 42.91

65-127 bytes 6.17 2.24 4.20

128-255 bytes 1.37 0.21 0.79

256-511 bytes 1.10 0.98 1.04

512-1023 bytes 3.54 1.32 2.43

1024+ bytes 5.96 91.32 48.64


3 Test Results

The test results show the most resource-intensive configuration that a VNF with 1, 2, 3, 4 or 7 DPI cores (2,4,6,8 or 14 Hyper-threads) in which the DPI cores handle at least 99% of the packets during the steady state phase of the test (see P3 in Figure 4). For each measurement, the maximum (downstream) link utilization (i.e., the link utilization of four 10 Gbps ports) is shown. This is determined by the constraints between the test parameters (i.e., for traffic profile 1, 100K concurrent connections and a maximum setup rate of 4K, it is impossible to have a link utilization above 10%. Increasing the link utilization further requires to increase the per connection bandwidth which reduces the length and prevents the connections to stay active long enough to reach 100K connections). Only the downstream link utilization is shown since the link utilization in the upstream direction is lower.

3.1 Results for Networks Characterized by Traffic Profile 1

The results for a network simulated based on traffic profile 1 with 100K, 200K, 400K, 800K and 1M connections are shown in Figure 16 through Figure 20.

The results in Figure 16 can be interpreted as follows. For the traffic profile used during the test (See Section 2.4), one DPI core needs to be provisioned to handle the load given that the maximum setup rate stays below 8K flows per second. If the flow setup rate is higher than 8K flows per second, one DPI core is only enough if the links are utilized at a lower degree (at 43.95% of the link capacity or, for a setup rate of 10K flows per second, 70.87% of the maximum possible link utilization). This equates to 17.58 Gbps (43.95% of the capacity four 10 Gbps links) downstream traffic. Similar interpretations can be made for the remaining figures.

As the number of concurrent connections increases in consequent figures, more connections need to be tracked by the DPI cores. In addition, the maximum setup rate is maintained for a longer period of time. If the number of concurrent connections is equal to or higher than 400K, the maximum setup rate is more than 20K and the link utilization is higher than 43.66%, at least 4 cores are required to handle the load.

Another way to study the results is to compare link utilization while fixing the number of cores and the maximum setup rate. For example, the link utilization with 1 core and a maximum setup rate of 40K, the link utilization is 42.55%, 15.97% 11.59% for a network with 100K, 200K and 400K connections. For a network with more connections, the link utilization drops to 0%, which shows that 1 core is not capable to handle the load at zero-loss as defined previously.

In Figure 19, error bars show the minimum and maximum measured zero-loss throughput for the configuration with 7 cores. The error bars for the other measurements have been omitted since the results were more stable.

Figure 16. Number of concurrent connections in steady state = 100K.






Figure 20. Number of concurrent connections in steady state = 1M.



This traffic profile is the most resource-intensive from a DPI point of view out of the three traffic profiles considered in this document. At least 4 DPI cores are needed to maintain maximum possible input rate for networks with characteristics of traffic profile 2 and up to 100K connections as shown by Figure 21. If rates around 40% of the maximum for networks with these characteristics are acceptable, 3 DPI cores are sufficient. The remaining figures (Figure 22, Figure 23, Figure 24, Figure 25 and Figure 26) show how performance degrades for higher number of concurrent connections.












The final traffic profile considered shows that one starting from two DPI cores, the maximum load can be sustained, illustrated by Figure 27. The same is true for most setup rates when the total number of concurrent connections increases to 200K and 400K as shown by Figure 28 and Figure 29. Starting from 800K concurrent connections, to handle the load under all circumstances, three DPI cores should be provisioned. Provisioning more cores does not show any further benefits; these cores can be allocated to different tasks within the VNF.









4 Appendix: Hardware and Software Details

ITEM DESCRIPTION NOTES

Platform Intel® Server Board S2600WT Family

Form factor 2U Rack Mountable

Processor(s) Intel® Xeon® processor E5-2690 v3 - 30720K L3 Cache per CPU

- 12 cores per CPU

- 24 logical cores per CPU

- 2.60 GHz

Memory 64 GB RAM (8x 8GB) per socket Quad channel 2134 DDR4

BIOS SE5C610.86B.01.01.0014.121820151719 Hyper-threading enabled

Hardware prefetching enabled COD disabled

COD disabled

NICs 8x Intel® 82599 10 Gigabit Ethernet Controller 4 x dual port cards

Table 7. Hardware platform.

ITEM DESCRIPTION NOTES

Host OS Arch Linux* Kernel version: 4.1.6-1-ARCH

DPDK IP stack acceleration Version 2.2.0.

PROX - SUT: mirroring function (to copy packets into DPI path), load dispatcher, and flow manager implementation to deliver only the packet payload the DPI engine. Statistics collection.

- TS (2x): Flow based load generator simulating clients and servers Statistics collection. For the first traffic profile, only one test system was used.

Version 0.28

The test systems and system under test are controlled from scripts provided as part of PROX in the helper/scripts/dpi directory. First, dpi1.py is run to collect the traffic profile boundaries. Second, dpi2.py is run which takes as input the traffic profile boundaries provided by dpi1.py.

Qosmos IxEngine IxEngine version 4.21, ProtoBundle 1.240

Table 8. Software components.

DisclaimersSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Processor numbers differentiate features within each processor family, not across different processor families: Go to: http://www.intel.com/products/processor_number.

© 2016 Intel Corporation. Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others 0916/DO/PDF 334874-001US

http://www.intel.com/benchmarks

http://www.intel.com/products/processor_number

Documents

Qosmos* Deep Packet Inspection Characterization - Intel · PDF fileWhite Paper: Qosmos* Deep Packet Inspection Characterization 5 2 Use-Case Details 2.1 Background on DPI Workload