8
Leveraging SIMD parallelism for accelerating network applications Hejing Li KAIST [email protected] Juhyeng Han KAIST [email protected] Dongsu Han KAIST [email protected] ABSTRACT Software packet processing frameworks act as critical components in modern network architecture, as their performance has a vital impact on the quality of the network services. Motivated by the increasing number and capability for advanced vector instructions in recent mainstream CPUs, this paper explores a new parallel processing design and implementation of data struc- tures and algorithms that are frequently used for build- ing network applications. In particular, we propose effec- tive SIMD optimization techniques for the bloom filter and Open vSwitch megaflow cache. Our design reduces memory access latency via careful prefetching and a new design that meets the needs of fast data consum- ing instructions. Our evaluation shows performance im- provements up to 162% in bloom filter and 48% in Open vSwitch compared to their scalar version. CCS CONCEPTS Networks Cloud computing; Computing method- ologies Parallel computing methodologies; 1 INTRODUCTION Hash tables are fast data structures for manipulating tables with a large number of entries. Due to simplic- ity and efficiency, hash tables and their variants act as fundamental components of several network applica- tions, such as tuple space search algorithm in virtual switches [22, 26] and hash-based pattern matching in IDS applications [11, 25]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advan- tage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. APNet ’20, August 3–4, 2020, Seoul, Republic of Korea © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-8876-4/20/08. . . $15.00 https://doi.org/10.1145/3411029.3411033 In a high-speed network environment equipped with 10/40/100 GbE network cards, such software packet clas- sification applications face over 14 millions of minimum- sized packets per second. Excessive hash function calls during the packet classification process gives heavy com- putational load to the CPU and has significant impact on the application’s performance. Meanwhile, architectural advancements have been made into the recent mainstream CPUs. Not only with respect to the number of cores and clock frequency, but they also adopt higher single-instruction-multiple-data (SIMD) capabilities including wider SIMD registers and more advanced vector instructions [2, 3]. SIMD tech- nology has been evolving from the early years and is showing no signs of stopping. In Intel CPUs, for exam- ple, the first generation SIMD technology (MMX) only uses 64-bit registers and provides only integer opera- tions. Since then SIMD instructions have been extended to SSE (128-bit), AVX/AVX2 (256-bit) and most recently, AVX-512 which supports 512-bit register operation. In addition to the extensions in register size, more practi- cal features such as floating-point number operations, non-contiguous memory load (gather), and store (scat- ter) operations are added to the latest versions of the ISA [2, 3]. Effectively utilizing SIMD capabilities can drastically improve the performance of the tasks running on CPUs. Various vectorized designs are proposed to accelerate database operations [23, 24, 28, 29], multimedia pro- cessing [9] as well as other applications that have data parallelism [19, 21]. However, converting a scalar code having its control flow into a vectorized code is not triv- ial for the data packed into a vector are executed with the same instruction no matter which branches are taken for the data. Thus conditional branches should be substi- tuted by the data flow, which means a series of arithmetic operations with specific operand for each data in differ- ent branch. Moreover, parallel operations often come at the expense of increased memory access latency. To be specific, since a data object resides in a contiguous memory block, a scalar code benefits from the spatial locality. However, loading data for parallel processing of- ten involves several discontiguous memory access. One 1

Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

Leveraging SIMD parallelism for acceleratingnetwork applications

Hejing LiKAIST

[email protected]

Juhyeng HanKAIST

[email protected]

Dongsu HanKAIST

[email protected] packet processing frameworks act as criticalcomponents in modern network architecture, as theirperformance has a vital impact on the quality of thenetwork services. Motivated by the increasing numberand capability for advanced vector instructions in recentmainstream CPUs, this paper explores a new parallelprocessing design and implementation of data struc-tures and algorithms that are frequently used for build-ing network applications. In particular, we propose effec-tive SIMD optimization techniques for the bloom filterand Open vSwitch megaflow cache. Our design reducesmemory access latency via careful prefetching and anew design that meets the needs of fast data consum-ing instructions. Our evaluation shows performance im-provements up to 162% in bloom filter and 48% in OpenvSwitch compared to their scalar version.CCS CONCEPTS•Networks→Cloud computing;•Computingmethod-ologies→ Parallel computing methodologies;

1 INTRODUCTIONHash tables are fast data structures for manipulatingtables with a large number of entries. Due to simplic-ity and efficiency, hash tables and their variants act asfundamental components of several network applica-tions, such as tuple space search algorithm in virtualswitches [22, 26] and hash-based pattern matching inIDS applications [11, 25].

Permission to make digital or hard copies of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advan-tage and that copies bear this notice and the full citation on the firstpage. Copyrights for components of this work owned by others thanACMmust be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. Request permissionsfrom [email protected] ’20, August 3–4, 2020, Seoul, Republic of Korea© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-8876-4/20/08. . . $15.00https://doi.org/10.1145/3411029.3411033

In a high-speed network environment equipped with10/40/100 GbE network cards, such software packet clas-sification applications face over 14 millions of minimum-sized packets per second. Excessive hash function callsduring the packet classification process gives heavy com-putational load to the CPU and has significant impacton the application’s performance.Meanwhile, architectural advancements have been

made into the recent mainstream CPUs. Not only withrespect to the number of cores and clock frequency, butthey also adopt higher single-instruction-multiple-data(SIMD) capabilities including wider SIMD registers andmore advanced vector instructions [2, 3]. SIMD tech-nology has been evolving from the early years and isshowing no signs of stopping. In Intel CPUs, for exam-ple, the first generation SIMD technology (MMX) onlyuses 64-bit registers and provides only integer opera-tions. Since then SIMD instructions have been extendedto SSE (128-bit), AVX/AVX2 (256-bit) and most recently,AVX-512 which supports 512-bit register operation. Inaddition to the extensions in register size, more practi-cal features such as floating-point number operations,non-contiguous memory load (gather), and store (scat-ter) operations are added to the latest versions of theISA [2, 3].

Effectively utilizing SIMD capabilities can drasticallyimprove the performance of the tasks running on CPUs.Various vectorized designs are proposed to acceleratedatabase operations [23, 24, 28, 29], multimedia pro-cessing [9] as well as other applications that have dataparallelism [19, 21]. However, converting a scalar codehaving its control flow into a vectorized code is not triv-ial for the data packed into a vector are executed withthe same instruction nomatter which branches are takenfor the data. Thus conditional branches should be substi-tuted by the data flow,whichmeans a series of arithmeticoperations with specific operand for each data in differ-ent branch. Moreover, parallel operations often comeat the expense of increased memory access latency. Tobe specific, since a data object resides in a contiguousmemory block, a scalar code benefits from the spatiallocality. However, loading data for parallel processing of-ten involves several discontiguous memory access. One

1

Page 2: Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

APNet ’20, August 3–4, 2020, Seoul, Republic of Korea Hejing Li, Juhyeng Han, and Dongsu Han

cache miss among the access would postpone the databeing loaded to the vector register and thus degradeperformance.

We argue that SIMDvectorization ismore than a paral-lel algorithm, but proper memory access latency hidingand corresponding data structure conversion. The arrayof structure form of data in scalar code leads to discon-tiguous memory access for the vectorized code to loadseveral data on the vector register. Instead, form of thestructure of array that locates the same field data in eacharray is preferred for the reason of that the instruction toload a contiguous memory block is much cheaper thanfrom discontiguous memory addresses [3].

In this paper we show two use cases of SIMD acceler-ation for vectorized bloom filter [24] and Open vSwitchmegaflow cache [22]. The algorithm of the existing vec-torized bloom filter [24] is designed for 256-bit ISA,which can process 8 32-bit keys. To assign an operandfor each lane by the boolean result of the conditionalbranch, it refers to a table having 28 entries and eachentry contains a set of operands for the vector. The ta-ble size grows exponentially to 216 entries for the naïveextension to the 512-bit ISA, which prevents its perfor-mance to scale with the size of the registers used for thealgorithm. To extend the algorithm to the 512-bit ISAwithout penalty, we exploit the new instructions in thestate of the art SIMD technology, AVX-512, and adoptedcareful data prefetching and loop unrolling optimizationtechniques to hide the memory access latency. Moreover,we propose a newdata structure for the packet batch rep-resentation in Open vSwitch megaflow cache to avoiddiscontiguous memory accesses.

Our evaluation shows bloom filter lookup throughputfor the synthesized inputs outperforms existing vector-ized bloomfilter for up to 162%. The end-to-end through-put of OvS with the microflow cache disabled, vector-ized Open vSwitch outperforms its scalar implementa-tion by up to 48%.

In addition, we show that SIMD aware data structureis mandatory for Open vSwitch acceleration by compar-ing its performance on the vectorized implementationwithout a new data structure.

2 OPPORTUNITY AND CHALLENGESTo improve the performance of network applications,many previous studies explore the design space of par-allelism relying on the massively-parallel processingpower of GPU [13, 15, 17, 27]. Some research reportsthat other hardware accelerators such as integrated GPU(APU) also help improve the performance [12]. Although

these approaches are effective, the involvement of hard-ware accelerators incurs extra expense for the devicesand energy consumption.In contrast, SIMD technology is widely adopted on

mainstream CPUs. Recent Intel CPUs support 512-bitinstruction set which can pack sixteen 32-bit integerswithin a vector. Although SIMD parallelization degreeis less aggressive than GPUs, it has benefited from otheraspects. First, it is cost-effective because it does not incurthe cost of buying and maintaining new hardware. Sec-ond, data is processed on CPU, thus, it has no additionallatency caused by transferring data to/from the accel-erators. Previous work reports it takes ∼ 15`s to sendone-byte data to and from a GPU [20], which is intoler-able for some latency-critical applications [16]. Withoutthe data transferring latency overhead, SIMD technologyin CPU is more attractive to carry out parallelization.

In leveraging SIMD, two considerationsmust bemadeto adopt the original program to use vector instructions.We present the two and demonstrate their use throughtwo case studies in this paper.Control flow to data flow conversion: SIMD vectoriza-tion requires an algorithmic change in the original pro-gram, which is different from vectorization in GPU. InGPU programming, a code named kernel is executed ona streaming multiprocessor. It follows SIMT (single in-structionmultiple threads) computationmodel inwhichthe cores in a streaming multiprocessor shares the sameprogram counter and executes the same instruction. Thecontrol divergence is managed by the hardware by turn-ing the threads on and off. Thus, a kernel code can bewritten in the same way as scalar code executed in CPU.

On the contrary, a SIMD program should handle thecontrol divergence of each data in a vector in the pro-gram. The conditional branches turn out to be a compar-ison operation that generates the bitmask value as theresult which indicates the lanes that taking the branchor not by the bit set. Then different branch operationsare done selectively to the lanes corresponding to thebitmask. Other lanes of data those are in the same vectorbut do not require the operation cause waste of vectorcomputation. Converting the control flow to a data flowwhile maintaining high utilization of the vector is notstraightforward.Memory access latency hiding:A 512-bit SIMD instruc-tion can consume sixteen 32-bit data at once. Feedingmultiple data quickly enough so as to prevent the in-struction from stalling is crucial for high performance. Ascalar code accesses one data structure at once and oncethe data structure is loaded on the cache, it can take ad-vantage of the spatial locality. However, for loading thedata onto a vector register, several addresses spread over

2

Page 3: Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

the array of structures are accessed. The instruction can-not retire until all the data is fetched from the memory.The possibility of a vector operation to benefit from thespatial locality is far less than the scalar code. Becauseof that, naïvely change the branches in the algorithm todata flow with the existing array of structure form ofdata might slow down the program. Thus, memory ac-cess latency hiding [16] becomes much more importantand must be considered along with the vectorizationalgorithm to effectively utilize SIMD instructions.

3 DESIGN AND IMPLEMENTATION3.1 Bloom FilterBloom filter is a probabilistic data structure designed totest whether an element is a member of a set. Multiplehash functions are used to index each element. To insertan element to the set, it sets each index bits to one in thebloomfilter, which is a fixed-length bit array. To query anelement for the membership, compute the hash valuesof the key, and test all the bits in the bloom filter. Ifall the bits are set, then the element is highly likely tobelong to the set. Else, if there exist a bit tested is zero,the element is definitely not in the set. The simplicityand space-efficient characteristics make bloom filter andits variants widely used in network applications [7, 8]and database systems [11].We start by giving a brief introduction to the scalar

bloom filter algorithm and then describe our implemen-tation details step by step. The hash function used in thebloom filter is MurmurHash3 which is well known as afast and uniform algorithm. We input the key with thedifferent seed values to get several independent hashvalues with one single hash algorithm. The keys, whichin this case 32-bit integer values are prepared as an inputarray. In the scalar implementation of the bloom filter,one key is processed at a time. The corresponding bitsat the offsets dependent on the hash values are tested inorder. Once a bit is not set, the key is discarded and thenext key is processed. For the lookup succeeded keys,the keys and their payloads are stored to the outputarray.Vectorization Algorithm: To re-write the bloom filterinto a vectorized program, the control flow should beconverted to the data flow. The data flow in our pro-gram referred to previous vectorized bloom filter [24]can be divided into three parts, which are hashing part,gather and test part, vector update part. We describeeach component below.Sixteen keys are loaded on the 512-bit vector at the

starting of the iteration. Each key starts to compute thefirst hash value with the hash seed index for each key

A B C D E F G H J K L M … Q R S TInput keys

Keys vector (initial) Seed index vector (new)

Seed value vector

1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1

g g f g f g g g f g g g g g g gKeys vector (new)

g(A) g(B) f(Q) g(D) f(R) g(F) g(G) g(H) f(S) g(J) g(K) g(L) g(M) g(N) g(O) g(P)

Hash func

Aborted keys bitmask

A B Q D R F G H S J K L M N O P

✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗

A B C D E F G H I J K L M N O P

memory

Figure 1: AVX-512 vectorized bloom filter overview

2 3 4memory

2 3 4

__m512i vector

0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0__mmask16 mask

Figure 2: Expand load and compress store intrinsicfucntion operations

points to the first seed value. The hash seed values areloaded according to the index values and input to thevectorized MurmurHash3 function together with thekeys vector. The vectorized hash function outputs six-teen 32-bit hash values in a 512 vector as the result.

The 32-bit filter words containing the bits to be testedare gathered from the memory. The offset of the wordis derived from the hash value divided by 32. The bitis tested for each key by AND operation with a maskvector that only the tested bit is set in each lane. Thetest result will return a 16-bit bitmask that indicates thetest failed keys’ lane. Then at the last stage, the hashseed index vector will be updated to point to the nextseed value of each lane. And for those keys that alreadyreached the last seed value, means all the bits for the keyare tested and no anyone bit is zero. Those keys will beregarded as to be pass and be stored in the output array.

In the next iteration, the keys that failed or passed allthe hash functions will be removed from the keys vector,and new keys will be loaded on those lanes. In the priorwork [24] it permutes the keys vector to separate thekeys should remain and those should be removed. Aset of permutation indices is indicating each lane’s newposition after the permutation. Those sets of indices arestored in a table and accessed using the 8-bit test resultas the offset to the set. We utilize the latest AVX-512instruction to remove the permuting process. The keyinstruction is expandload as shown in Figure 2 whichloads contiguous values in memory into specific lanesin the vector that indicated by the bitmask.

3

Page 4: Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

APNet ’20, August 3–4, 2020, Seoul, Republic of Korea Hejing Li, Juhyeng Han, and Dongsu Han

...…

...…

...

Input Keys Array

Partition 1 Partition 2

iteration load iteration load iteration load

iteration load

Figure 3: Multi-way loop unrolling

Figure 1 shows a possible situation in the second itera-tion. The aborted keys bitmask tells the positions of keysthat failed or succeed in all the tests in the previous iter-ation. Those lanes will be overwritten by the new keysin the input array. The loop iterates until remained keysare less than aborted keys in previous iteration. And therest keys are processed serially.Vector Loop Unrolling: Modern processors feature ad-vanced out-of-order execution to avoid pipeline stalls.Loop unrolling increases the number of independentinstructions in the loop body and it leads to increasingopportunities to benefit from this feature and improvingthe performance. The previous work [24] implementedloop unrolling with a maximum factor of 2, which pro-cesses two vectors of keys in the loop body.We extend the prior work [24] by introducing multi-

way loop unrolling implementation. It divides the inputkeys array into the number equal to the unrolling factorU. In the loop body, the number of U vectors load inputkeys from the start of each partition as shown in Figure 3.In the next iteration, each vector loads new keys to beprocessed in its own partition. The for-loop ends whenone of the vectors among those reaches the end of itspartition, and the rest of the keys in the input array willbe handled serially. In this way, the unrolling factor canbe adjusted to any number to get the best performance.Bloom Filter Prefetch: Because filter word gatheringoperation is random access to the memory, it causes fre-quent cache misses and significant performance degra-dation when the bloom filter size exceeds the size of thecaches. Several processing vectors in the loop body givesfurther chances for the acceleration. That is to adoptmemory access latency hiding technique, prefetch thefilter words in the first hash value vector and then pro-ceed to compute the hash value of the second keys vectorand prefetch the filter words according to the secondhash value vector and so on. When all the vectors in theloop body finish the hash computing and prefetching,then the loop body goes on for the remaining procedureof gathering and testing. Several processing vectors ex-isting in the loop body enabled hash computing andmemory accessing interleaving such that improves thelookup performance with a large size of bloom filters.

Functions CPU%

netdev_flow_key_hash_in_mask 48.67%cmap_find_batch 18.48%dpcls_rule_matches_key 6.54%total 73.69%

Table 1: OVS profiling data shows megaflow cachelookup is a bottleneck. Tested with the number of av-erage traversed subtables set to 30 per packet, eachsubtable contains 2000 rules3.2 Open vSwitchThe performance of the OvS packet classification is criti-cal to the quality of network services. Megaflow cache asa flow cache module in the packet classification pipelineshould handle most of the packets when it faces a largenumber of flows. The profiling data for stress testingmegaflow cache in Table 1 depicts that the major bottle-neck in megaflow cache lookup is computing the hashvalues for the packets. Our goal is to vectorize this withSIMD instructions.ExistingData Structure andAlgorithm: OvSmegaflowcache adopts tuple space search algorithm to performpacket classification. A tuple or subtable means a collec-tion of classification rules that matches the same combi-nation of fields with the same wildcard. For each sub-table, the wildcardmask is applied to all the packets thatremaining in the batch. The only fields that the subtablecares are ANDed with the mask value and composedinto the hash key for computing the signature.The data structure used in this algorithm is miniflow,

which is a compact data structure to represent the packetheader information. OvS defines a huge flow data struc-ture in which enumerates all the header fields used inall kinds of protocols that OvS supports. Because of itswide coverage, the size of it is fairly large and sparse torepresent a packet header with it. miniflow is proposedto minimize the packet batch memory footprint and thenumber of accessed cache lines. It has a bitmap of whicheach bit represents each uint_64 value in struct flow. Azero bit indicates the corresponding 64-bit value in structflow is zero and a 1-bit that may be nonzero value. Thenonzero 64-bit values follow the bitmap.In addition to the packets, a subtable mask is also

represented as miniflow. The values of it are the maskthat 1-bits set on the bits to match. The process of sub-table lookup by miniflows of packets and subtable maskis like following. The header fields of the packet thatthe subtable miniflow bitmap indicates are extracted. Itstarts from the rightmost bit in mask bitmap, comparesthe bit of packet bitmap that at the same position. Ifthe bit in packet bitmap is zero, the extracted value is

4

Page 5: Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

Existing Packet Batch Memory Layout

Transposed Packet Batch Memory Layout

64-bit values

v1_1

v1_2

32-bit values

vn_1

vn_2

Packet_1 v_1 v_2 … v_n p_1 p_2 … p_32

Packet_2 v_1 v_2 … v_n

Packet_32 v_1 v_2 … v_n

Packet_3 v_1 v_2 … v_n

p_1 p_2 … p_32

p_1 p_2 … p_32

p_1 p_2 … p_32

Figure 4: Existing and new data structure memory lay-outzero. Else, if the bit in packet bitmap is one, the packetvalue is retrieved from its value array. The values indexis the same as the number of 1-bits at the right side ofthe current rightmost mask bit. Subtable mask valuesare applied to the extracted packet values to wipe thedon’t care bits out and fed to the hash function.Limitation of Naïve SIMD Implementation: The ap-proach to naïvely convert the control flow to the dataflowwith existing data structure andmanipulation algo-rithm to process several packets in one vector degradesperformance in practice. Although the operations com-prising the most of this process such as arithmetic oper-ations(plus and minus), bitwise operations(AND) canbe parallelized with SIMD vector operations.The fundamental problem is the array of structure

form of packet batch data structure as shown in Figure 4.Loadingmultiple packet data requires to accessmultiplediscontiguous memory addresses and the possibility ofall the data accessing hit the cache is far lower than thatof accessing single data. As a result, the overhead ofloading data to the SIMD vector offsets the benefits fromthe vectorized hash computing.

To remove the memory access overhead in the vector-ized program, we propose a new data structure which isthe transposed form of the existing one and correspond-ing vectorized algorithm to handle it.Transposed Data Structure: To retrieve the packet datafrom one contiguous memory block, the form of struc-ture of array is preferred. We come up with the trans-posed miniflow data structure as shown in Figure 4. Thefirst 64-bit values of the packets in the batch(maximum32 packets) are arranged into two arrays. We divide a64-bit value into a higher 32-bit value and a lower 32-bitvalue and pack all lower 32-bit values from 32 packetsin one array, all higher 32-bit values pack into the otherarray. The reason for truncation is that internal of thehash function on 64-bit value is actually done by hashinglower 32-bit value first and higher 32-bit value followed.By separating two 32-bit values into different arrays, wecan directly load lower/higher 32-bit values instead ofloading a whole 64-bit value and then do extra vector

operation to separate them. The transposed miniflow isgenerated in addition to the normal miniflow. When thepacket header is parsed and the value is stored in theminiflow, we copy the value to the transposed miniflowstructure.Vectorized Algorithm:With the transposed data struc-ture, loading the packet value does not access multiplememory addresses, instead, loading from a contiguousblock. A 512-bit vector contains 16 32-bit values is passedto the vectorized hash function and 16 hash values areproduced in one iteration. SIMD instructions are effec-tive when the vector is full of data, yet they result inlower throughput per cycle compared to scalar instruc-tions. As a result, processing a few data using vectorregister costs more time than serial processing. Thus,we set the threshold for vector processing such that weexecute SIMD code only when the number of packetsremaining in the batch is larger than the threshold. Weconduct micro-benchmark to determine the thresholdwhich is 8 in our case.

In the vectorization algorithm, we assume that thenetwork traffic has a majority type of packets such asTCP, UDP, or any tunneling protocols. The packets inthe same protocol will have the same miniflow bitmap,and it ensures the values in an array of the transposedminiflow are the data of the same header field. For thoseexceptional packets having different miniflow bitmapwith the majority, different field values will be stored inthe array. Taking account of the fact that an array in thetransposed miniflow structure is deemed to be all thesame field values and be retrieved for computing thesignature, an exceptional packet is getting the wrongsignature value for it. It results in the megaflow cachemiss and the packet will be passed to costly OpenFlowtable lookup which is an undesired case. To prevent theexceptional packets tomiss themegaflow cache,wemarkthe exceptional packets in the batch before the megaflowcache lookup. For each subtable, when the number ofpackets is larger than the threshold, we first processthe batch of packets with AVX-512 vectorized code anddo the extra serial computing only for the exceptionalpackets.

4 EVALUATION

4.1 Bloom FilterWe evaluate the performance of AVX-512 vectorizedbloomfilterwhichwe callHPBF(highly vectorized bloomfilter). From the evaluation we emphasize that naïvelyadopting SIMD instructions is leading to suboptimalperformance than it actually could achieve. We show

5

Page 6: Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

APNet ’20, August 3–4, 2020, Seoul, Republic of Korea Hejing Li, Juhyeng Han, and Dongsu Han

0

50

100

150

200

250

8K 512K 4M 256M

Thro

ughp

ut (M

ops)

Bloom Filter Size (Byte)

Baseline(256-bit) AVX512(Naïve) HPBF

94.2% 79.8% 30.8% 33.7%

Figure 5: Bloom filter throughputwith varying SIMD vector size im-plementation

050

100150200250300350400450500550

8K 512K 4M 256M

Thro

ughp

ut (M

ops)

Bloom Filter Size (Byte)

Baseline(256+LU2) HPBF HPBF+LU2

HPBF+LU3 HPBF+LU4 HPBF+LU5

162.4%125.8% 65.1% 39.5%

Figure 6: Bloom filter throughputwith varying number of loop un-rolling

0

10

20

30

40

50

60

128M 256M

Thro

ughp

ut (M

ops)

Bloom Filter Size (Byte)

Baseline(256+LU2) AVX512(Naïve)HPBF HPBF+LU4HPBF+LU4+PREFETCH

65.4%60%

Figure 7: Bloom filter throughput oflarge size

the impact of data flow conversion using new instruc-tions and memory access latency hiding techniques giveto the performance in the case study.

The experiment for the bloomfilter lookup is executedon the platform with Intel Xeon Silver 4114 at 2.2GHz.The CPUhas 32KB L1 data cache, 1MBL2 cache and 10ML3 cache on-chip. We randomly generate 1M keys fol-lowing the uniform distribution.Wemeasure the lookupthroughput in millions of operations per second. Thenumber of elements in the filter is set to 1/10 of the fil-ter bits and 3 hash functions to get the moderate falsepositive rate. Error bars show a 95% confidence interval.

First we show the performance improvement of HPBFwhich extends the vector size from 256 to 512 as well asadopts expand load/compress store instructions into dataflow instead of the permutation table. We only use asingle core in the CPU for all the experiments. Figure 5shows HPBF outperforms AVX/AVX2(256-bit) baselineup to 94%. The varying size of bloom filters that don’texceed the capacity of L1 cache, L2 cache, L3 cache, aswell as the one that exceeds all level of caches are tested.The throughput and the speedup rate are decreasedas the size of the bloom filter increases as memory ac-cessing dominants the performance. The naïve AVX-512implementation, which only extended the size of thevectors while retaining the same algorithm has littleimprovement due to an exponentially enlarged permu-tation table.Figure 6 shows the loop unrolling optimization af-

fects the performance upon the parallelized algorithm.The baseline is 256-bit vectors with the factor of 2 loopunrolling. Loop unrolling optimization has a drastic im-provement on the throughput up to 162% compared tothe baseline. The throughput grows as the loopunrollingfactor increases until a certain number, then degradeswith more loop unrolling. That is because aggressiveloop unrolling might incur register threshing. We alsofind that the loop unrolling factor at the peak of theperformance increases with the growing filter size.

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Tim

e Co

st (n

s)Number of Packets Left in the Batch

SERIAL_MURMUR SERIAL_CRC AVX512_MURMUR

Figure 8: Time cost for computing signature values fora batch of packetswith varying number of packets left

-0.09

0.18 0.33

0.41 0.45 0.48 0.48

-0.14

0.00 0.09 0.14 0.17 0.17 0.17

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 5 10 15 20 25 30

spee

dup

rate

Thro

ughp

ut (M

bit/

s)

Number of traversed subtables in average per packet

serial-murmur serial-crcavx512-murmur-naïve avx512-murmur-transposespeedup rate (serial-murmur) speedup rate (serial-crc)

Figure 9: End-to-end throughput for TCP packet traf-ficFigure 7 shows the impact of prefetching upon loop

unrolling when the size of the bloom filter is large. Loopunrolling hides thememory access latency relying on theadvanced out-of-order execution capabilities in modernCPUs. We find that prefetching can further optimize thememory access latency. It outperforms baseline whichis using 256-bit with factor of 2 loop unrolling up to65% and the loop unrolling implementation withoutprefetching 25%.4.2 Open vSwitchWe use two servers with Intel Xeon Gold 6128 CPU at3.40 GHz, one for running the OvS, and the other one forrunning the software packet generator MoonGen[10].

6

Page 7: Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

Twomachines are connected by dual-port 10 Gbps NICs.We generate minimum-size 64-byte TCP packets withline-rate using MoonGen packet generator. We run OvSv2.11.1 with DPDK with the PMD thread pinned to anisolated core. OvS-DPDK bridge is set up with the rulesto match all the packets and switch the packets to theother port. We turn off the Exact Match Cache(EMC)to benchmark the performance of standalone megaflowcache performance. It can be regarded as the worst caseperformance with random and a large number of shortlived connections that many packets miss the EMC.

Figure 8 shows the time cost for producing signaturevalues for one packet batch in a subtable. For serial imple-mentation, the time cost is proportional to the numberof packets in the batch that have not yet found a match-ing rule. However, AVX-512 vectorized megaflow runstwo iterations no matter how many packets left. Thereason is AVX-512 processes 16 consecutive packets inthe batch. It doesn’t pick the remained packets to packthe vector, since record the packets’ indices in the batchand gather the data from discontiguous memory costextra processing time. To make sure all positions in thebatch are processed, we run two iterations either thenumber of packets is less or larger than 16. Thus, AVX-512 vectorized megaflow cache has the same processingtime across the x-axis. We also compare with the perfor-mance of CRC hash algorithm which is an alternativehash function in OvS using CRC intrinsic functions inSSE ISA. We find that when the number of packets inthe batch is less than 8, serial processing with CRC isfaster than AVX-512. So we set the threshold value to 8according to this result.

We evaluate the end-to-end throughput of the vector-ized OvSwith varying numbers of subtable traversal perpacket. Figure 9 shows AVX-512 implementation withtransposed data structure outperforms all the serial im-plementation when the number of subtables is largerthan 5. And the speedup rate keeps growing up to 17%compared to CRC and 48% compared to murmur hashfunction.

5 RELATEDWORK AND FUTUREWORK

SIMD parallelism has been studied for several databasesystems. Zhou et.al. [29] proposed vectorized databaseoperations including sequential scans, aggregation, in-dex operations etc. Cargri et.al explored SIMD imple-mentations of sort-merge join and radix-hash join [5].There are some parallel sorting algorithms for SIMDprocessors [6, 14]. Harald Lang et.al presented the al-gorithm that addresses control flow divergence for the

data-parallel query pipelines [18]. Orestis and Kennethproposed vectorized bloom filter [24] to efficiently uti-lize vector lanes.

SIMD capabilities are also adopted in various networkapplications and frameworks. DPDK [1] has vector PMDas well as hash table library which optimizes packet I/Oand comparison operations using SSE/AVX instructions.VPP [4] also makes use of SIMD instructions for its pro-cessing. However, neither of them is adopting SIMD atthe code with control flow.In the future, we will adapt our bloom filter design

to the real world network applications and also applyvectorized packet classification to other network appli-cations.6 CONCLUSIONIn this paper, we proposed the effective ways to leverag-ing SIMD parallelism to improve the performance of thedata structure and algorithm used in network applica-tions. Not only the conversion of the algorithm from con-trol flow to the data flow but also memory access latencyhiding and SIMD aware data structure design should beconsidered. We showed the impact of those by two casestudies which are AVX-512 vectorized bloom filter andthe vectorized OvS megaflow cache. We show that theperformance of the bloom filter lookup improved up to162% and 48% for OvS.7 ACKNOWLEDGMENTThis work is supported by Institute of Civil MilitaryTechnology Cooperation Center (ICMTC) funded by theKorea government (MOTIE & DAPA) [18-CM-SW09];and Institute for Information & communications Tech-nology Promotion (IITP) grant funded by the Korea gov-ernment (MSIP) (No.R-20160222-002755, Cloud basedSecurity Intelligence Technology Development for theCustomized Security Service Provisioning)REFERENCES[1] Data plane development kit. https://www.dpdk.org/.[2] Intel 64 and IA-32 Architectures Software Developer’s Man-

ual. https://software.intel.com/sites/default/files/managed/a4/60/325383-sdm-vol-2abcd.pdf.

[3] Intel intrinsics guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#.

[4] The vector packet processor (vpp). https://fd.io/docs/vpp/master/whatisvpp/index.html.

[5] Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M TamerÖzsu. Multi-core, main-memory joins: Sort vs. hash revisited.Proceedings of the VLDB Endowment, 7(1):85–96, 2013.

[6] Jatin Chhugani, AnthonyDNguyen, VictorWLee,WilliamMacy,Mostafa Hagog, Yen-Kuang Chen, Akram Baransi, Sanjeev Ku-mar, and Pradeep Dubey. Efficient implementation of sortingon multi-core simd cpu architecture. Proceedings of the VLDB

7

Page 8: Leveraging SIMD parallelism for accelerating network ... · Leveraging SIMD parallelism for accelerating network applicationsAPNet ’20, August 3–4, 2020, Seoul, Republic of Korea

APNet ’20, August 3–4, 2020, Seoul, Republic of Korea Hejing Li, Juhyeng Han, and Dongsu Han

Endowment, 1(2):1313–1324, 2008.[7] Byungkwon Choi, Jongwook Chae, Muhammad Jamshed, Ky-

oungsoo Park, and Dongsu Han. {DFC}: Accelerating string pat-tern matching for network applications. In 13th {USENIX} Sym-posium on Networked Systems Design and Implementation ({NSDI}16), pages 551–565, 2016.

[8] Sarang Dharmapurikar, Praveen Krishnamurthy, Todd Sproull,and John Lockwood. Deep packet inspection using parallelbloom filters. In 11th Symposium onHigh Performance Interconnects,2003. Proceedings., pages 44–51. IEEE, 2003.

[9] Keith Diefendorff, Pradeep K Dubey, Ron Hochsprung, andHASH Scale. Altivec extension to powerpc accelerates mediaprocessing. IEEE Micro, 20(2):85–95, 2000.

[10] Paul Emmerich, Sebastian Gallenmüller, Daniel Raumer, FlorianWohlfart, and Georg Carle. Moongen: A scriptable high-speedpacket generator. In Proceedings of the 2015 Internet MeasurementConference, pages 275–287. ACM, 2015.

[11] Shahabeddin Geravand and Mahmood Ahmadi. Bloom filter ap-plications in network security: A state-of-the-art survey. ComputerNetworks, 57(18):4047–4064, 2013.

[12] Younghwan Go, Muhammad Asim Jamshed, YoungGyounMoon, Changho Hwang, and KyoungSoo Park. Apunet: Re-vitalizing {GPU} as packet processing accelerator. In 14th{USENIX} Symposium on Networked Systems Design and Imple-mentation ({NSDI} 17), pages 83–96, 2017.

[13] Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. Pack-etshader: a gpu-accelerated software router. ACM SIGCOMMComputer Communication Review, 41(4):195–206, 2011.

[14] Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, and ToshioNakatani. Aa-sort: A new parallel sorting algorithm for multi-core simd processors. In 16th International Conference on ParallelArchitecture and Compilation Techniques (PACT 2007), pages 189–198. IEEE, 2007.

[15] Keon Jang, Sangjin Han, Seungyeop Han, Sue B Moon, and Ky-oungSoo Park. Sslshader: Cheap ssl accelerationwith commodityprocessors. In NSDI, pages 1–14, 2011.

[16] Anuj Kalia, Dong Zhou, Michael Kaminsky, and David G Ander-sen. Raising the bar for using gpus in software packet processing.In 12th {USENIX} Symposium on Networked Systems Design andImplementation ({NSDI} 15), pages 409–423, 2015.

[17] Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, JunhyunShim, and Sue Moon. Nba (network balancing act): A high-performance packet processing framework for heterogeneousprocessors. In Proceedings of the Tenth European Conference onComputer Systems, page 22. ACM, 2015.

[18] Harald Lang, Linnea Passing, Andreas Kipf, Peter Boncz, ThomasNeumann, and Alfons Kemper. Make the most out of your simdinvestments: counter control flow divergence in compiled query

pipelines. The VLDB Journal, 29(2):757–774, 2020.[19] Samuel Larsen, Rodric Rabbah, and Saman Amarasinghe. Ex-

ploiting vector parallelism in software pipelined loops. In 38thAnnual IEEE/ACM International Symposium on Microarchitecture(MICRO’05), pages 11–pp. IEEE, 2005.

[20] Daniel Lustig and Margaret Martonosi. Reducing gpu offloadlatency via fine-grained cpu-gpu synchronization. In 2013 IEEE19th International Symposium on High Performance Computer Archi-tecture (HPCA), pages 354–365. IEEE, 2013.

[21] Gaurav Mitra, Beau Johnston, Alistair P Rendell, Eric McCreath,and JunZhou. Use of simdvector operations to accelerate applica-tion code performance on low-powered arm and intel platforms.In 2013 IEEE International Symposium on Parallel & Distributed Pro-cessing, Workshops and Phd Forum, pages 1107–1116. IEEE, 2013.

[22] Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, AndyZhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Joe Stringer,Pravin Shelar, et al. The design and implementation of openvswitch. In 12th {USENIX} Symposium on Networked SystemsDesign and Implementation ({NSDI} 15), pages 117–130, 2015.

[23] Orestis Polychroniou, Arun Raghavan, and Kenneth A Ross. Re-thinking simd vectorization for in-memory databases. In Pro-ceedings of the 2015 ACM SIGMOD International Conference onManagement of Data, pages 1493–1508. ACM, 2015.

[24] Orestis Polychroniou and Kenneth A Ross. Vectorized bloomfilters for advanced simd processors. In Proceedings of the Tenth In-ternational Workshop on Data Management on NewHardware, page 6.ACM, 2014.

[25] Haoyu Song, Sarang Dharmapurikar, Jonathan Turner, and JohnLockwood. Fast hash table lookup using extended bloom filter:an aid to network processing. In ACM SIGCOMM ComputerCommunication Review, volume 35, pages 181–192. ACM, 2005.

[26] Venkatachary Srinivasan, Subhash Suri, and George Varghese.Packet classification using tuple space search. InACMSIGCOMMComputer Communication Review, volume 29, pages 135–146. ACM,1999.

[27] Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis,and Sotiris Ioannidis. {GASPP}: A gpu-accelerated statefulpacket processing framework. In 2014 {USENIX} Annual Techni-cal Conference ({USENIX}{ATC} 14), pages 321–332, 2014.

[28] Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plat-tner, Alexander Zeier, and Jan Schaffner. Simd-scan: ultra fastin-memory table scan using on-chip vector processing units. Pro-ceedings of the VLDB Endowment, 2(1):385–394, 2009.

[29] Jingren Zhou and Kenneth A Ross. Implementing databaseoperations using simd instructions. InProceedings of the 2002ACMSIGMOD international conference onManagement of data, pages 145–156. ACM, 2002.

8