Deliverable 5.7: Final Report on the Universal Node Architecture ...€¦ · This project is co-funded by the European Union. Document information Editor Gergely Pongrácz (Ericsson)

Deliverable 5.7: Final Report on the UniversalNode Architecture, Orchestration andPerformance Evaluation

Dissemination level PUVersion 0.1Due date 30.06.2016Version date 30.07.2016

This project is co-fundedby the European Union

Document information

Editor

Gergely Pongrácz (Ericsson)

Contributors

Hagen Woesner (BISDN), Fulvio Risso (POLITO), Ivano Cerrato (POLITO) Jon Matias (EHU), Vinicio

Vercellone (TI), Kostas Pentikousis (TP)

Reviewers

Kostas Pentikousis (TP)

Coordinator

Dr. András Császár

Ericsson Magyarország Kommunikációs Rendszerek Kft. (ETH)

Könyves Kálmán körút 11/B épület

1097 Budapest, Hungary

Fax: +36 (1) 437-7467

Email: [email protected]

Project funding

7th Framework Programme

FP7-ICT-2013-11

Collaborative project

Grant Agreement No. 619609

Legal Disclaimer

The information in this document is provided `as is', and no guarantee or warranty is given that the

information is �t for any particular purpose. The above referenced consortium members shall have no

liability for damages of any kind including without limitation direct, special, indirect, or consequential

damages that may result from the use of these materials subject to any liability which is mandatory due to

applicable law.

c© 2013 - 2016 by UNIFY Consortium

Deliverable 5.7: Final Report i

mailto:[email protected]

Executive Summary

This document summarizes the �nal results on the Universal Node research and development in the context

of the UNIFY project.

The Universal Node comprises several components, including the UN Orchestrator, the software �ow switch,

a range of compute execution environments and network virtualization facilities at L2 and L3, as well as

the UN northbound interface that enables recursive, domain-oriented orchestration, as well as information

sharing for node description, capabilities and resource discovery.

In the course of the project the Universal Node has been designed, developed, and showcased at several

venues. Many original design decisions proved correct, others were revisited based on the implementation

experience as per the original project plan. This deliverable presents the �nal Universal Node architecture

and orchestration mechanisms, summarizes the use cases considered, and reports on the modelling and

empirical performance valuation results.

Deliverable 5.7: Final Report ii

Contents

1 Introduction 1

2 Universal Node: Architecture and Orchestration 3

2.1 Preliminaries and De�nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Operating Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 Domain-oriented Orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.2 Network Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.3 Compute Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.4 Joint Optimization of the Service Graph Network and Compute . . . . . . . . . . . 7

2.3.5 NF Direct Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.6 High Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.7 Small Footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.8 Native Network Functions Support . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Universal Node Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 un-orchestrator Northbound Interface . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Node Resource Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.3 Network Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.4 Tra�c Steering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.5 Compute Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.6 Native Network Functions Support . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.7 VNF resolver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.8 Internal Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.9 Monitoring Manager Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5 UN Data Plane Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.1 VNF Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.2 Ericsson Research Flow Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 VNF Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Supported Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.8 UN and the ETSI NFV Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.9 UN-based Elastic Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.10 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Universal Node Use Cases 29

3.1 Elastic Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Broadband Network Gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 QinQ to GRE Tunneling and Routing . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 Built-in NAT Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Deliverable 5.7: Final Report iii

4 UN Modelling 32

4.1 Considerations for Elementary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Data transfer between memory and the NIC . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 Computing checksums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.3 Hash lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.4 Longest Pre�x Match lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Modelled use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Network Address Translation (NAT) . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.2 Border Network Gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.3 IPSec ESP protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Mapping to Intel Xeon Sandy Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 The Sandy Bridge architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.2 Fine tuning EOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.3 NAT performance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.4 BNG performance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.5 IPSec ESP performance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Mapping to Intel Atom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.1 Changed EOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42




4.5 Mapping to Cavium ThunderX CN88XX . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.1 Hardware architecture and speci�cations . . . . . . . . . . . . . . . . . . . . . . . 44

4.5.2 Elementary Operations on ThunderX . . . . . . . . . . . . . . . . . . . . . . . . . 46




5 Performance Evaluation 51

5.1 Broadband Network Gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 QinQ to GRE pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.2 NAT implementation and validation . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1.3 BNG validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Conclusions 57

Deliverable 5.7: Final Report iv

List of Figures

1 Network functions and services example . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 NF deployment on a carrier network equipped with Universal Nodes . . . . . . . . . . . . . 5

3 Main features of the Universal Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 High-level view of the Universal Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Detailed architecture of the Universal Node . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Excerpt of Network Function - Forwarding Graph (NF-FG) . . . . . . . . . . . . . . . . . 12

7 Example of vlan endpoint in the NF-FG . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

8 Example of GRE-tunnel endpoint in the NF-FG . . . . . . . . . . . . . . . . . . . . . . . 13

9 un-orchestrator new NF-FG deployment message sequence diagram . . . . . . . . . . . 14

10 Example of splitting tra�c steering rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

11 Tra�c crossing network functions: (a) the service graph; (b) its implementation on the UN. 21

12 Implementing direct channels between VMs in OvS. . . . . . . . . . . . . . . . . . . . . . 22

13 ETSI NFV architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

14 Mapping UN components on the ETSI NFV architecture . . . . . . . . . . . . . . . . . . 26

15 System view of the Elastic Router prototype . . . . . . . . . . . . . . . . . . . . . . . . . 27

16 Broadband Network Gateway with external NAT: end-to-end scenario . . . . . . . . . . . . 30

17 Broadband Network Gateway with external NAT: pipeline setup . . . . . . . . . . . . . . . 31

18 Broadband Network Gateway application with built-in NAT . . . . . . . . . . . . . . . . . 32

19 NAT UpLink model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

20 NAT DownLink model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

21 BNG UpLink model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

22 BNG DownLink model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

23 IPSec ESP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

24 Simpli�ed pipeline of Sandy Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

25 AES-256 performance estimation on Intel Xeon . . . . . . . . . . . . . . . . . . . . . . . 40

26 SHA-256 performance estimation on Intel Xeon . . . . . . . . . . . . . . . . . . . . . . . 40

27 AES-256 performance estimation on Intel Atom . . . . . . . . . . . . . . . . . . . . . . . 43

28 SHA-256 performance estimation on Intel Atom . . . . . . . . . . . . . . . . . . . . . . . 44

29 Cavium ThunderX Coherent Memory Interconnect . . . . . . . . . . . . . . . . . . . . . . 45

30 Measurement setup for the simpli�ed BNG setup . . . . . . . . . . . . . . . . . . . . . . . 52

31 Xeon NAT UpLink estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

32 Xeon NAT DownLink estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

33 Atom NAT UpLink estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

34 Atom NAT DownLink estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

35 Xeon BNG UpLink estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

36 Xeon BNG DownLink estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

37 Atom BNG UpLink estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

38 Atom BNG DownLink estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Deliverable 5.7: Final Report v

Introduction

The goal of this deliverable is to report the �nal results of the project achievements on the Universal Node

line of work. The UNIFY project set out to address the �rigid network control and infrastructure limits�

which hinder today's carrier network operators to �exibly specify and deploy new network services. In this

e�ort, state-of-the-art orchestration in the management and control planes and high-performance data plane

functionality are a top priority. Flexibility to employ di�erent hardware and virtualization solutions based

on a variety of factors ranging from capital and operational costs (CAPEX, OPEX) to end-to-end service

delivery objectives and considerations to widespread infrastructure programmability, is of great importance

as well. Addressing software and hardware heterogeneity across the carrier network infrastructure was also

high on our research agenda.

The UNIFY project addresses the full network spectrum from home and enterprise networks through ag-

gregation and core networks to data centres, creating a uni�ed, programmable production environment in

which virtual network functions (VNFs) can be instantiated and chained together to create complete network

services. The Universal Node plays a central role in this environment as it is able to run several VNFs on

multiple di�erent architectures at a high performance.

Essentially, the Universal Node combines packet forwarding and network function operations in a �single

box�. Throughout the project we experimented with various such boxes in which VNFs are designed and

implemented to operate in a virtualized environment. In practice, we employed �lightweight� VNFs in

containers (LXC/Docker) as well as more �heavyweight� virtualization approaches based on KVM/Xen.

Moreover, we also explored the use of native network functions that, when designed with multi-tenancy

in mind, can also be included under the control of the overarching and domain-speci�c orchestrator, thus

enabling us to deliver on the key premise of the project that the same mechanisms are applied across the

entire network.

To address the challenges in this area, WP5 designed, implemented, and showcased on several occasions the

UN Orchestrator (un-orchestrator) which plays the pivotal role of unifying the northboung interface of

the Universal Node. The un-orchestrator provides network and compute abstractions, facilitates domain-

oriented orchestration, and enables joint optimization of the service graph compute and network resources.

In practice the un-orchestrator has been used on a very wide range of architectures ranging from high-end

Intel servers to commercial Wi-Fi routers used as customer premises equipments (CPEs) to very low-cost

ARM-based switches.

Below the un-orchestrator, a foundational component of the Universal Node is a software �ow switch,

which is able to direct the tra�c between the VNFs as well as also providing network functions such as

routing, tunneling or realizing a �rewall.

From a data plane perspective, high-performance is a prerequisite for the adoption of the Universal Node in

production environments. There are several hardware architectures which can be used for running network

functions starting from the standard Intel x86 CPUs and more and more popular ARM CPUs to the special-

ized, custom-designed Network Processing Units (NPUs). Di�erent VNFs might have a better performance

on a speci�c architecture than on others. Thus this deliverable complements Deliverable D5.6 [34] which

presented component performance evaluation results by reporting complete service pipeline performance

evaluation results.

The remainder of this deliverable is organized as follows. We revisit the UN design objectives and architecture

Deliverable 5.7: Final Report 1/62

in Section 2 and detail how our implementation experience forged certain research directions. We then

summarize in Section 3 the use cases considered in evaluating the Universal Node, namely the Elastic

Router use case from a control and management plane point of view and the Broadband Network Gateway

(BNG) use case from a data plane performance point of view. We model the Universal Node in Section

4, present empirical network service pipeline performance evaluation results in Section 5 and conclude this

document in Section 6.


Firewall

Adv blocker

Non-web trafficNAT

Private storage

Security NATL2 switch

Private storage

L2 switch

Service “user green”

Service “user green”

Service “security”

a)

b)

Figure 1: Network functions and services example

Universal Node: Architecture and Orchestration

This section presents the Universal Node system architecture and details how orchestration is implemented.

In the following subsection we introduce the key concepts and terms that we will use in the remainder of

this deliverable. We then present the Universal Node design objectives and architecture before delving into

the data plane design considerations. We conclude this section with a comparison to ETSI NFV and other

related work.

Preliminaries and De�nitions

A network function (NF), or simply function, is a functional block or application that either operates on

(e.g., reads, manipulates) tra�c in transit, or acts as the origin/�nal destination of said tra�c. Without loss

of generality, and essentially for illustration purposes, we will often refer to ��rewall� and �NAT� as examples

of the �rst class of functions in the remainder of this section. The second class includes, for example,

applications related to the �traditional� cloud computing world, such as the private storage depicted in

Figure 1.

A service graph is a set of functions suitably interconnected to implement a speci�c service (e.g., a

comprehensive security solution). As illustrated in Figure 1, links between functions can be potentially

marked with the tra�c that has to cross such connections: this feature allows us to di�erentiate how

various types of tra�c are treated.

As can be surmised from Figure 1, each function may, in turn, be de�ned as a service graph. As an example,

consider the NF security in Figure 1(a) which is in fact a service graph composed of a firewall and an

ad blocker, as shown in Figure 1(b).

We de�ne an infrastructure node as a physical machine (which can range, e.g., from a high-volume

standard server to a customer premises equipment (CPE) WiFi router that can execute NFs in one or more


execution environments, and which implements the paths among them as described in the service graph.

A capability represents a characteristic of the infrastructure node which can be employed, for example, by an

overarching orchestrator to select the best node on which (part of) the service graph has to be instantiated.

We distinguish between two types of capabilities: functional and infrastructure. The former indicates that

the infrastructure node is able to implement speci�c functions, e.g., because they are implemented with a

software/hardware component available on the node itself, or because they are available as software images

that can be executed on demand. Examples of functional capabilities include firewall and NAT, possibly

with some speci�c attributes such as 3 ports, NICs with line rate of 1 Gb/s, support for IPv4 or IPv6, etc.

A functional capability does not specify how a function is implemented. The same capability may be in

fact implemented in a variety of ways, for example, in a virtual machine (VM) or as a process running

natively on the node and taking advantage of any available hardware acceleration facilities such as for

encryption/decryption. The selection of the most-suitable implementation for the task at hand is left to the

infrastructure node. An infrastructure capability is instead a low-level characteristic of the infrastructure

node, such as its CPU architecture, the possibility to execute KVM-based VMs or Docker containers, and

so on.

A resource represents the available amount of a hardware component of the infrastructure node that can

be used to execute NFs, such as: available RAM (e.g., 4 GB), CPU (e.g., CPU 1: 100% - CPU 2: 25%),

hardware acceleration, and so on. Typically this information is to be taken into account by the overarching

orchestrator while selecting the best infrastructure node for NF deployment.

Operating Context

Figure 2 depicts a telecom operator network consisting of several heterogeneous infrastructure nodes that can

be used to execute NFs. As mentioned above, infrastructure nodes range from a CPE at the subscriber home,

typically based on low-cost hardware, to high-end servers in the operator data centers. Each infrastructure

node is actually a Universal Node that receives instructions from an overarching orchestrator (or a hierarchy

of orchestrators) and executes all operations required to make the service fully functional.

Moreover, the operator network can also include other modules (e.g., a GUI used to de�ne the desired

service graph) that interact with each other, with the overarching orchestrator and potentially even with

the UNs, in order to make the service deployment possible.

Note that this deliverable focuses on the UN. The description of the overarching orchestrator and other

modules that are part of the operator NFV architecture are out of scope; for instance, the role of the

overarching orchestrator can be played by existing prototypes such as FROG [24] or ESCAPE [21][25].

Design Objectives

This section summarizes the design objectives of the UN (illustrated in Figure 3), focusing primarily on the

unique characteristics of this platform when compared to earlier proposals in the NFV literature, both with

respect to academic work and industry initiatives.

Domain-oriented Orchestration

The Universal Node can interact with overarching orchestrators operating with two di�erent models: domain-

oriented orchestration based on functional capabilities, and what we will refer to as �legacy� orchestration,


Private storage

Adv blocker

Firewall Service graph

Overarchingorchestrator

Virtual switch

NF1

NF2

Orchestrator

NF3

NF4Home gateway

Data center

Telecom operator network

Virtual switch

NF1

NF2

Orchestrator

NF3

NF4

Un

iversal No

de

Un

iversal No

de

Figure 2: NF deployment on a carrier network equipped with Universal Nodes


Data plane

Control plane

Joint optimization of the Service Graph network and compute

Domain-oriented orchestration: export functional capabilities (e.g., NAT), not only infrastructure parameters (e.g., CPU, RAM)

Upper layers can ask for a given function (e.g., firewall), not for a VM

Small footprint Runs on domestic CPEs

Network abstraction: support multiple virtual switches

Compute abstraction: support multiple execution environments

NF direct connections

Native Network Functions support

High performance: support to the Ericsson Research Flow Switch

Can support hardware accelerators

Automatic selection of the best implementation for a function

Transparently support different implementations for a function

Figure 3: Main features of the Universal Node

akin to what is currently used in the cloud computing/data center world based on infrastructure capabilities

and available resources. More speci�cally, the former model abstracts each infrastructure node, which we

will refer to as a �domain�, with a set of functional capabilities, thus hiding from the overarching orchestrator

the internal details of each domain, such as the amount of available resources or the way in which a NF is

actually implemented.

Consequently each UN, in addition to information such as the amount of CPU/RAM available on the node

and infrastructure capabilities such as the possibility to execute KVM-based virtual machines (needed to be

compatible with legacy orchestrators), exposes functional capabilities.

This enables upper orchestration layers to request a NF such as a firewall instead of a prescribing a

�VM implementing a �rewall�. This has the following advantages: (i) a UN can support di�erent NF

implementations and employ them transparently from the upper orchestration layers; and (ii) a UN can

�exibly select the best implementation for a given request and speci�c NF, according to the current state

of the UN per se, i.e., number of CPU cores currently available, possibility to exploit hardware acceleration

and so on, and then operate the NF in the most e�cient way.

Network Abstraction

The UN control plane can interact with di�erent virtual switches (vSwitches) in order to implement paths

between functions, each of which may have characteristics appropriate for a speci�c UN deployment. For

instance, a vSwitch optimized to exploit hardware acceleration available on the speci�c node is well suited

when such hardware component(s) exist, while it may not the best choice in cases where the UN is deployed

on a high-volume standard server.


Compute Abstraction

The UN control plane is able to interact with di�erent execution engines and therefore execute NFs in

di�erent ways, depending for example on the availability of some hardware accelerator or speci�c software

libraries on the infrastructure node, and the amount of available resources.

Joint Optimization of the Service Graph Network and Compute

The service to be implemented by the UN is speci�ed according to the Network Functions-Forwarding Graph

(NF-FG) formalism detailed in Section 2.4.1, which describes both the compute and networking aspects of

the service, i.e., the required functions and the interconnections between them. This way, each UN maintains

the complete overview of the entire service, and can thus optimize for instance the binding between NFs

and CPU cores by considering the way NFs are interconnected with each other in the service graph.

NF Direct Connections

When tra�c steering is implemented through Open vSwitch (OvS), and NFs are DPDK [12] processes

executed in VMs, the UN should able to optimize packet transfer between NFs by bypassing the vSwitch

when the service speci�es point-to-point connections among them.

High Performance

The UN supports the Ericsson Research Flow Switch [44], which implements a high-speed OpenFlow pipeline

that can be con�gured through OpenFlow 1.3. The two main optimizations that ERFS uses are the real-time

and per-table optimization for classi�cation / lookup and the just-in-time building of the packet processing

code.

The reason for developing an entirely new OpenFlow pipeline was the fact that OpenFlow as a protocol

seemed to meet the requirements of several network functions since it is generic enough, but OpenFlow

software switch implementations seemed to lack high performance. The community also seemed to think

that although OpenFlow software switches are true masterpieces of genericity, this genericity usually comes

at a high performance cost. With the design and development of ERFS we argued that this is not a must, but

merely a design decision or rather �aw in current OpenFlow pipelines. ERFS uses a novel switch architecture

that capitalizes on the observation that OpenFlow pipelines are su�ciently structured to admit e�cient

machine code representations, constructed out of simple packet processing and classi�cation templates. The

resultant specialized data paths then were shown to give major gains over �ow-caching-based alternatives,

with several times higher raw packet rates, much smaller latency, and, perhaps most importantly, robust

and predictable performance even with widely varying, or straight-out borderline malicious, workloads. The

proposed switch architecture easily scales to hundreds of �ow tables and hundreds of thousands of tra�c

�ows, while supporting updating the fast path at similar, or higher intensity.

Small Footprint

Small memory footprint enables service deployment on resource-limited hardware already available at the

edge of the network, e.g., on residential CPEs, and not only on high-volume standard servers available in

data centers. This is a design feature that in a real-world deployment will enable an overarching orchestrator

to optimize NF placement and scheduling. As an illustrative example, NFs required to be close(r) to the


end users (e.g., secure tunnel termination, low-latency functions, etc.) can be instantiated directly on the

CPE, while other functions of the same service (e.g., network address translation) can be executed in a data

center.

Native Network Functions Support

Virtualized engines typically exploited to execute NFs (e.g., VMs, Docker containers) are quite demanding

in terms of resources (e.g., memory, CPU), therefore they may not be the most appropriate solution in case

of low cost/resource constrained devices such as CPEs. Such devices typically run a Linux-based operating

system that includes a number of software modules (e.g., iptables) that can be used to implement

NFs executed directly on the host, in order to provide services such as those that can be executed in

VMs/containers (e.g., �rewall), but with reduced overhead. Furthermore, CPEs may include some hardware

components (e.g., crypto hardware accelerator, integrated L2 switch) that can be exploited to implement

NFs as well. The UN exploits these native (software and hardware) modules available in a CPE to implement

Native Network Functions (NNF).

Both the small footprint of the UN components and the support for NNFs enable the operator to deploy

the service graph on CPEs, which are then integrated in the overall NFV carrier architecture, as per the

UNIFY vision on a �uni�ed production environment�. In turn, this results in the possibility for an overarching

orchestrator to optimize the scheduling of functions by starting those that require to be close to the end

users (e.g., IPsec termination, low-latency functions, etc.) directly on the user CPE, while other functions

of the same service (e.g., the NAT) can be executed in a remote data center.

Universal Node Architecture

Figure 4 illustrates a high-level view of the UN control and data planes. The control plane includes the

Universal Node orchestrator (un-orchestrator), which handles the service graphs lifecycle, and the VNF

repository, which is employed by the un-orchestrator to select the VNF image implementing a speci�c

function. The VNF repository may also be deployed on another server and can be contacted by multiple

UNs. The data plane includes, for example, a vSwitch that manages the tra�c paths between the NFs, and

a number of compute engines that can execute NFs implemented with di�erent technologies (e.g., virtual

machines, containers, or natively).

The un-orchestrator, illustrated in Figure 5, is the main component of the UN control plane, as it

implements most of the features described in Section 2.3. It orchestrates compute and network resources

within the Universal Node that manages the complete lifecycle of the virtual execution environment(s) and

networking primitives. The un-orchestrator receives commands through a northbound interface and takes

care of implementing them on the infrastructure node. The un-orchestrator relies on the VNF resolver

to select the best VNF implementation available that matches the service request.

The modules of the un-orchestrator are described in the remainder of this section, together with the

un-orchestrator northbound interface, which is used to interact with the upper layers of the orchestration

architecture.


Un

iver

sal N

od

e

Orchestrator

Network Functions Forwarding Graph (NF-FG) Node description,capabilities and resources

VNFrepository

Dockerengine

DPDKengine

KVMhypervisor

Nativefunctions

VNF4VNF2 VNF3

Virtual switch

VNF1

Figure 4: High-level view of the Universal Node

un-orchestrator Northbound Interface

The un-orchestrator interacts with the upper layers of the orchestration architecture (see Section 2.2)

through a bidirectional northbound interface. Speci�cally, the un-orchestrator receives a service graph

described according to the NF-FG formalism, and exports information using an OpenCon�g-derived YANG

module describing the UN domain.

As shown in Figure 5, create, read, update and destroy (CRUD) commands related to the NF-FG are

received through a REST API, while the UN description is exported through Double Decker (DD) which

among other things, supports a publish/subscribe model. For more details on DD see Deliverable D4.4 [29],

section 6.

The overarching orchestrator knows exactly the UN entity (e.g., the IP address of the un-orchestrator) on

which (part of) the service has to be created/updated/deleted, and hence a REST interface is appropriate for

this functionality. On the other hand, by employing DD the un-orchestrator does not need to keep track

of the consumers of the description it exports, as there may be several entities interested in such information.

All UNs publish their description through DD using a speci�c topic, enabling all entities interested in such

information to subscribe to that topic. In this case, a pertinent example is the above-mentioned orchestrator,

which can use the UN descriptions to select the node where the service has to be deployed.

As depicted in Figure 5, the REST server interacts with the security manager, a module that manages

authentication and checks the permissions of the entities that can send commands to the un-orchestrator.

For instance, only the Telecom operator may be allowed to deploy NF-FGs, while end users may be only

authorized to read the current service graph operating on their own Internet connection.

The NF-FG formalism describes the service to be instantiated on the Universal Node both in terms of com-

pute abstractions (i.e., functions composing the service) and network abstractions (i.e., tra�c steering rules)

[24]. Furthermore, it may contain some annotations expressed using the MEASURE language (described

in [29], section 3) that specify which parameters of the NF-FG should be monitored. As shown in Figure 6,

each NF-FG consists of three main parts, which we explain next.

First, the VNFs part lists the functions that compose the service. Particularly, the NF-FG may require a

function without specifying how it must be implemented (e.g., firewall in the �gure). In this case the


Un

iver

sal N

od

e

Orchestrator

Virtual switch LSI - 0

Network Functions Forwarding Graph (NF-FG)

Compute manager

Nativedriver

libvirt

VNF4

DPDKdriver

Dockerdriver

VNF2

LSI - graph 1

REST server

Virtual Link among LSIs

Network function port(s)(between an LSI and a VNF)

OpenFlow connection

Compute control

Network control

Resource manager

VMdriver

LSI - graph N

VNF5

Local DB manager

Network manager

Node description, capabilities and resources

VNF resolverVNF scheduler

VNFrepository

Dockerengine

KVM hypervisor

DPDKengine

DoubleDecker client

Nativefunctions

OF driver

Control interface Management interface

xDPddriver

ERFSdriver

OvSdriver

OF ctrlLSI #N

VNF1 VNF3

Otherdrivers

ovsdb

MontoringManager

Message

bus

(broker)

Othermodules

DD

client

DD

client

Figure 5: Detailed architecture of the Universal Node


proper image is selected by the un-orchestrator through the interaction with the VNF resolver. However,

the NF-FG can also request for a speci�c implementation of a function, by specifying a template that

speci�cally describes it in terms of, e.g., the image to be executed, the number of CPU cores needed,

technology to be used to implement the virtual network interface cards (vNICs), and so on.

The endpoints section describes the ingress/egress points of tra�c in the (part of the) service deployed

on the UN. Endpoints can be used both in the match and in the action part of the tra�c steering rules.

In general, we have considered L1, L2 and L3 endpoints. With respect to the prototypical realization we

implemented three speci�c endpoints, namely interface, vlan and GRE-tunnel, as typical representatives of

the corresponding L1-L3 endpoints. In this case, while interface endpoints (Figure 6) correspond to physical

interfaces of the UN, a vlan endpoint only includes tra�c associated with a physical interface belonging to

a speci�c VLAN. This means that the UN guarantees that only the tra�c with a speci�c VLAN ID arrives

from this endpoint (e.g., VLAN ID 25 in Figure 7), and that all tra�c inserted on such an endpoint is

tagged (by the UN itself) with the proper VLAN ID.

The GRE-tunnel endpoint (an example of which is shown in Figure 8) represents the termination of a GRE

tunnel. The UN guarantees that only the tra�c encapsulated in a speci�c GRE tunnel enters from this

endpoint, and that all tra�c exiting from such an endpoint will be encapsulated in such a GRE tunnel.

Both the vlan and GRE-tunnel endpoints can be used to properly steer tra�c between parts of the same

service that are deployed on di�erent infrastructure nodes.

Finally, the big-switch section in Figure 6 describes the interconnections between NF ports and endpoints.

We opted to use semantics akin to OpenFlow whereby each connection is characterized by: (i) a priority;

(ii) a match on an endpoint/VNF port and potentially on protocol �elds (e.g., IP source); and (iii) an

action that forwards packets through a speci�c endpoint/VNF port and that potentially modi�es the packet

content (e.g., decrease the IPv4 TTL).

The UN description that published by the un-orchestrator via DD includes both network and compute

aspects of the infrastructure node. From the network point of view, a UN is abstracted as a �big switch� with

a set of endpoints, each one characterized with the following (optional) information: (i) neighbor domain

that is the identi�er of another infrastructure node (e.g., UN), in case there is a direct connection with it;

(ii) IP address; (iii) support to VLAN tra�c in which case the endpoint indicates the available VLAN IDs;

and (iv) support for GRE tunnels.

From the point of view of compute, the UN exports instead functional and infrastructure-level capabilities,

as well as available resources, as mentioned earlier.

Node Resource Manager

The node resource manager is the main module of the un-orchestrator, as it handles the commands

received through the REST API, and exports the UN description both at node boot time as well as at every

node recon�guration.

More speci�cally, when the un-orchestrator receives a command to create a new NF-FG, the node

resource manager (i) interacts with the VNF resolver in order to select the most appropriate VNF image for

each function that is part of the service and that is not explicitly associated with an image in the NF-FG;

(ii) con�gures the vSwitch to create a new Logical Switch Instance (LSI) and the ports required to connect

it to the VNFs to be deployed; (iii) deploys and starts the selected VNFs; and (iv) con�gures the forwarding

table(s) of the LSI(s) according to the tra�c steering rules to be implemented.


{

"nf-fg":

{

"id": "0x1",

"name": "NF-FG",

"VNFs": [

{

"id": "0x1",

"name": "firewall",

"ports": [

{

"id": "0xa",

"name": "internal port"

},

....

]

},

....

],

"endpoints": [

{

"id": "0x1",

"type": "interface",

"interface": {

"if-name": "eth1"

}

}

....

],

"big-switch":

{

"flow-rules": [

{

"id": "0x1",

"priority": 1,

"match":

{

"port_in": "endpoint:0x1",

....

},

"actions": [

{

"output_to_port": "vnf:0x1:0xa",

....

}

]

},

....

]

}

}

}

Figure 6: Excerpt of Network Function - Forwarding Graph (NF-FG)


{

"id": "0x2",

"type": "vlan",

"vlan":

{

"vlan-id": "25",

"if-name": "eth1"

}

}

Figure 7: Example of vlan endpoint in the NF-FG

{

"id": "0x3",

"type": "gre-tunnel",

"gre-tunnel":

{

"local-ip": "10.0.0.1",

"remote-ip": "10.0.0.2",

"gre-key" : "0x1"

}

}

Figure 8: Example of GRE-tunnel endpoint in the NF-FG

Similarly, the node resource manager takes care of updating or destroying a graph, when the corresponding

commands are received.

Figure 5 shows that the un-orchestrator includes, among other modules, the network manager and the

compute manager, both of which are employed by the node resource manager to interact respectively with

the vSwitch and the execution engines to create the proper paths and start the proper VNFs.

The VNF resolver interacts with the VNF repository and selects the best implementation for the required

functions, according to some parameters such as the amount of compute and memory resources available

on the UN. Finally, in the Universal Node system architecture the VNF scheduler optimizes the VNF/CPU

core(s) binding(s) by taking into consideration information such as how a VNF interacts with the rest of

the NF-FG.

Figure 9 presents the message sequence diagram for the case where the un-orchestrator deploys a new

NF-FG.

Network Manager

The network manager is the un-orchestrator module that handles the networking part of the service

graph. It interacts with several vSwitches (with possibly di�erent characteristics) in order to create the

virtual network infrastructure that implements the paths described in the NF-FG, and which supports the

parallel instantiation of multiple service graphs.

As shown in Figure 5, the network manager implements the network paths between VNF ports and endpoints

through multiple LSI layers. The foundation LSI-0 is overlayed by and a set of LSIs (graph-LSI-N, where

N ≥ 1), each one in charge of implementing paths between VNFs of a di�erent graph. LSI-0 is created at

boot time and dispatches the tra�c from the UN physical interfaces to the graph-LSIs while the additional

LSIs (each one created when a new NF-FG has to be deployed) implement the tra�c steering paths between

the VNFs and GRE-tunnel endpoints that belong to that graph. In fact, while physical interfaces are


Resource manager VNF resolver Compute manager Network manager VNF repository Execution environment vSwitch

selectImplementations(n�g.NFs)

getTemplates(NF)

list<Template> nfTemplates

looploop [for each NF]

list<SelectedImpl> selectedImplementations

createLSI(n�g.tra�cSteeringRules, n�g.NFports, n�g.GreTunnelEndpoints)

createLSI()

∗graphLSInumberOfVlinks(tra�cSteeringRules)

NconnectLSIs(graphLSI, LSI-0, N)

list<virtualLinks>

createPort(graphLSI, nfport→next→technology)

∗NF.port

looploop [for each NF port]

createGreEndpoint(graphLSI, gre-tunnel→next→description)

∗endpoint

looploop [for each GRE-tunnel]

list<NFport> NFports,list<Endpoint> endpoints,list<Link> virtualLinks

createVNFs(n�g.NFs)

createVNF(nf→next→ports,nf→next→parameters)

∗vnfstartVNF(vnf)

looploop [for each NF]

list<VNFs∗ > vnfs

createTSrules(n�g.tra�cSteeringRules, virtualLinks)

splitRule(rule, virtualLinks)

splitcreateTSrule(graphLSI, split

∗

looploop [for each tra�c steering rule]

Figure 9: un-orchestrator new NF-FG deployment message sequence diagram


Function Description

lsi ∗createLSI() Create a new LSI

void destroyLSI(lsi) Delete a speci�c LSI

list<∗link> connectLSIs(lsi1, lsi2, N) Create N virtual links between two LSIs

void destroyVlink(link) Destroy a virtual link between two LSIs

port ∗createPort(lsi, technology) Create a (VNF) port with a speci�c technology on an LSI

void destroyPort(port) Destroy a speci�c (VNF) port

endpoint ∗createEndpoint(lsi, description) Create an endpoint on an LSI, according to a speci�c description

void destroyEndpoint(endpoint) Delete a speci�c endpoint

Table 1: Management plane interface


void createTSRule(lsi, rule) Send a tra�c steering rule to the LSI

void deleteTSRule(lsi, rule) Remove a tra�c steering rule from the LSI

Table 2: Control plane interface

connected to LSI-0, VNF ports and GRE-tunnel endpoints are connected to the associated graph-LSI.

Setting up the paths described in the NF-FG requires from the network manager to interact with the vSwitch

to: (i) create a new graph-LSI with the required virtual ports that will be later attached to the VNFs through

the management plane interface. The technology of the virtual ports depends on the VNF image selected.

For instance, in the case of a DPDK-enabled process dpdkr, ports must be used to interconnect the vSwitch

with said VNF. (ii) program the forwarding tables of LSI-0 and of the new graph-LSI in order to realize

tra�c steering through the control plane interface.

The management plane interface enables the network manager to manage di�erent vSwitches (without

knowing anything about the switching technology) through the set of primitives listed in Table 1 and

implemented by each technology-speci�c driver. Basically, these primitives enable the network manager

to (i) create/destroy an graph-LSI, (ii) create/destroy a port that will be then connected to a VNF, (iii)

create/destroy virtual links between two graph-LSIs, and (iv) create/destroy endpoints.

Currently, our prototype supports Open vSwitch (OvS) [45], the extensible Data-Path deamon (xDPd) [19]

and the Ericsson Research Flow Switch (ERFS), but further vSwitches can be supported by writing the

corresponding driver implementing the primitives of Table 1.

Similarly, the control plane interface enables the network manager to con�gure the forwarding table(s) of

the LSIs by means of multiple technologies (e.g, OpenFlow, eBPF), while at the same time hiding from the

network manager the actual technology used. The set of primitives de�ned by the control plane interface

are listed in Table 2 and must be implemented by the technology-speci�c controllers in order to enable the

network manager to insert/remove tra�c steering rules in/from a corresponding LSI.

As shown in Figure 5, a di�erent technology-speci�c controller is created for each LSI, which controls

the forwarding table of the LSI itself. In the current prototype we have opted to support OpenFlow for

tra�c steering; thus, as shown in the �gure, each technology-dependent controller is actually an OpenFlow

controller, and OpenFlow flowmod messages are used to set up the tra�c steering rules.

Tra�c Steering

The two-level hierarchy of LSIs requires that for each new NF-FG to be deployed, the network manager: (i)

connects the new LSI with the LSI-0 through an adequate number of virtual links, and that, (ii) starting

from the big-switch section of the NF-FG, originates two sets of tra�c steering rules to be respectively


ACTION: output toInterface/VLAN endpoint VNF port / GRE-tunnel endpoint

MATCHInterface/VLAN endpoint only LSI-0 (no vlink needed) LSI-0 and graph-LSIVNF port / GRE-tunnel endpoint LSI-0 and graph-LSI only graph-LSI (no vlink needed)

Table 3: LSIs involved in implementing tra�c steering rules with speci�c match/action

installed in LSI-0 and in the new graph-LSI.

According to Table 3, some tra�c steering rules can be implemented on a single LSI, because both the

match and the action involve ports/endpoints that are connected to the same LSI, while other rules must

be split in a rule for the LSI-0 and in a rule for the graph-LSI. While the former do not require the use of

virtual links, as they keep tra�c local to one LSI, the latter rules need virtual links to transfer the packets

between LSI-0 and the graph-LSI.

The number of virtual links used to connect LSI-0 with the graph-LSI is calculated as follows. A di�erent

virtual link is required for each endpoint/VNF port that appears in the action of rules involving both LSIs

(Table 3), and such a virtual link will be used to transfer from one LSI to the other all tra�c that must

be sent on that speci�c endpoint/VNF port. Two rules that forward packets through the same VNF port

originate a single virtual link.

This is shown in Algorithm 1, where lines #7 and #10 associate each virtual link with a particular end-

point/VNF port. In order to minimize the number of virtual links, the same virtual link can be used both to

transfer (i) from LSI-0 to graph-LSI all tra�c towards a speci�c VNF port/GRE-tunnel endpoint attached

to such an LSI and, (ii) from graph-LSI to LSI-0, all tra�c towards a speci�c interface or VLAN endpoint.

Algorithm 1 Virtual link creation

1: procedure createVlinks(�rst_id,n�g,lsi0,graphLsi)2: vlink_to_lsi0 ← vlink_to_graphlsi ← �rst_id3: association ← ∅4: for all r ∈ n�g.bigswitch do

5: if vlink_needed[r.match.port][r.action.out] and association[r.action.out] ∈ ∅ then

6: if r.action.out ∈ vnf_port or r.action.out ∈ gre_tunnel then7: association[r.action.out] ← vlink_to_graphlsi8: vlink_to_graphlsi ← vlink_to_graphlsi+19: else if r.action.out ∈ interface_endpoint or r.action.out ∈ vlan_endpoint then10: association[r.action.out] ← vlink_to_lsi011: vlink_to_lsi0 ← vlink_to_lsi0+112: end if

13: end if

14: end for

15: N ← max(vlink_to_lsi0,vlink_to_graphlsi)−�rst_id

16: return connectLSIs(lsi0,graphLsi,N)

Algorithm 2 shows how the network manager properly splits the rules, after the involved VNF ports/endpoints

have been associated with one of the virtual links just created.

Lines #4-#12 of the pseudocode rules whose output port is connected to graph-LSI generate two LSI

speci�c rules as follows. The match of the LSI-0 rule corresponds to the match of the original rule (lines

#7), while the action di�ers from the original action only in the output port �eld. In fact, it forwards packets

on the virtual link that transfers to graph-LSI all packets towards the original port (line #8).

Consequently (lines #11-#12) the rule for the graph-LSI just matches the proper virtual link and forwards

tra�c on the output port of the original rule. This behavior can be observed in the �rst and third rules of


Algorithm 2 Tra�c steering rules creation

1: procedure splitRules(association,n�g)2: for all r ∈ n�g.bigswitch do

3: if vlink_needed[r.match.port][r.action.out] then4: if r.action.out ∈ vnf_port or r.action.out ∈ gre_tunnel then5: {The rule brings tra�c from LSI-0 to the graph-LSI}6: {Create the rule for LSI-0}7: rule-LSI0.match ← r.match8: rule-LSI0.action.out ← association[r.action.out]9: rule-LSI0.action.other ← r.action.other10: {Create the rule for the graph-LSI}11: rule-graphLSI.match.port ← association[r.action.out]12: rule-graphLSI.action.out ← r.action.out13: else if r.action.out ∈ interface_endpoint or r.action.out ∈ vlan_endpoint then14: {The rule brings tra�c from the graph-LSI to LSI-0}15: {Create the rule for LSI-0}16: rule-LSI0.match.port ← association[r.action.out]17: rule.LSI0.action.out ← r.action.out18: {Create the rule for the graph-LSI}19: rule-graphLSI.match ← r.match20: rule-graphLSI.action.out ← association[r.action.out]21: rule-graphLSI.action.other ← r.action.other22: end if

23: end if

24: end for

Figure 10, where port 1 of the NAT is associated with the virtual link vlink1.

Lines #13-#21 show the splitting of rules whose action sends packets on a port connected to LSI-0. Unlike

in the previous case, now it is the match of the graph-LSI rule that corresponds to the match of the original

rule (lines #19), as well as it is the action of the rule on such an LSI that is equal to the original action

except for the output port �eld, that corresponds to the virtual link that brings to LSI-0 all the packets

towards the original output port.

Lines #16-#17 of Algorithm 2 show that the rule created for LSI-0 just matches the proper virtual link. An

example of this procedure is shown in the second row of Figure 10, where the virtual link used to transfer

tra�c to eth1 is the same one used to bring tra�c to the port of the NAT.

Compute Manager

The compute manager interacts with the computing engines and handles the NF lifecycle (e.g., create,

update, destroy a NF), including operations needed to attach NF ports already created on the vSwitch (by

the network manager) to the NF itself. The compute manager module can interact with di�erent execution

engines, and can thus manage NFs based on di�erent technologies, through the compute interface de�ned

in Table 4.

As shown in Figure 5, this abstraction is implemented by a set of drivers, each one in charge of a speci�c

execution environment technology. Currently, our prototype supports the QEMU/KVM hypervisor, Docker

containers, processes based on the DPDK framework [12], and native network functions. Multiple technolo-

gies are supported at the same time. For instance, the un-orchestrator can deploy a service including a

�rst NF executed in a Docker container and a second NF running inside a VM.


LSI - 0

Match: port eth0, ip_src=10.0.0.1

Action: push VLAN 0x25, output to portnat1

eth0

vlink1

nat1 nat2

LSI - graph 1

Match: port vlink1Action: output to port nat1

Match: port eth0, ip_src=10.0.0.1Action: push VLAN 0x25, output to vlink1

eth1

Match: port nat2, ip_dst=10.0.0.2Action: pop VLAN, output to port eth1

Match: port nat2, ip_dst=10.0.02Action: pop VLAN, output to vlink1

Match: port vlink1Action: output to port eth1

Match: port eth1

Action: output to port nat1

Match: port eth1Action: output to vlink1

Match: port vlink1Action: output to port nat1

NAT

Rule on lsi-0

Rule on lsi-0

Rule on lsi-0

Rule on LSI-graph1

Rule on LSI-graph1

Rule on LSI-graph1

Rule in the NF-FG

Rule in the NF-FG

Rule in the NF-FG

Figure 10: Example of splitting tra�c steering rules.


vnf ∗createVNF(ports,other parameters) Allocate the resources needed by a VNF; create a local copy of/downloadthe VNF image

void destroyVNF(vnf) Release the resources allocated to the VNF

void startVNF(vnf) Start a VNF previously created

void stopVNF(vnf) Stop a VNF, without deallocating resources

void updateVNF(vnf,. . . ) Update a running VNF (e.g., remove/add network interfaces)

void pause(vnf) Suspend the execution of the VNF (e.g., for a possible migration)

Table 4: Compute interface


Native Network Functions Support

The integration of NNFs in an NFV platform using UNs requires to identify the di�erences between VNFs

and NNFs and derive a set of constraints for the un-orchestrator when selecting the implementation of

the required function.

One di�erence is that some NNFs do not support multiple concurrent instances, while in principle it is

always possible to instantiate multiple parallel instances of the same, e.g., VM image if su�cient resources

are available on the infrastructure node. In order to be reused in the same or di�erent service graphs at the

same time, the NNF must be implemented to support multi-tenancy and be thus de�ned in the associated

template.

Another di�erence between NNFs and VNFs is that each NNF can be associated through the template with

a number of dependencies that might refer either to hardware devices available on the UN or to software

packages (e.g., executable, binaries) that are already installed and that are required by the NNF to operate.

To be more precise, VNFs can also be associated with one dependency, e.g., VMs require the KVM hypervisor

while containers require the Docker engine. However, while a VNF does not require anything more than the

proper execution engine, NNFs mat require particular hardware and software components.

For each NF in the NF-FG, the un-orchestrator decides whether to deploy it as a VNF or a NNF based on

its knowledge of the node capability set, the available NNFs and their characteristics (e.g., whether they are

multi-tenant or not), their status (e.g., already used to implement another NF- FG) and their dependencies.

When a NNF should be used, the compute manager handles such a function through the proper driver,

which implements the interface given in Table 4.

In our prototype realization for the case of NNFs, this driver runs a collection of bash scripts that control

the basic lifecycle (create, update, etc.) of the NFF itself, and that are speci�c for such a function. The

NNF driver starts the NNF in a new network namespace, to provide a basic form of isolation, and con�gures

the NNF with a prede�ned con�guration script. Support for a dynamic con�guration mechanism able to

translate a generic NF con�guration provided by the un-orchestrator in commands appropriate to the

speci�c NNF is part of future work.

Launching a NNF, hence a script running on the bare hardware, o�ers less protection guarantees than starting

a software in a VM or in a Docker container, which can leverage the additional protection shield provided by

the hypervisor or the Docker execution engine. For instance, a little protection exists to limit the resources

used by NNFs, e.g., in terms of CPU/memory consumption or the number of occupied CPU cores. Although

the impact of the above problems could be limited by turning on some addition Linux mechanism such as

cgroup, this make complicate the solution at a point to which other alternatives may be more appealing,

such as replacing the NNF with a Docker-based implementation. In a broader sense, in any case right now

no protection exists that prevents a VNF, which is expected to provide a given service (e.g., �rewall), to

behave di�erently, e.g., to launch an attack toward a remote host and the current solution is simply to trust

the creator of the application or the entity (e.g., app marketplace owner) that sells it. Therefore, although

we acknowledge that the problem of determining whether a NF is malicious is emphasized in case of NNF

because of their inferior degree of isolation, we feel that the problem is rather general and should require a

more generic solution that guarantees, a priori, the goodness of the VNF.

Above issues may be partially solved by techniques de�ned in WP4; for instance, the VeriGraph tool ([29],

section2) checks whether the service graph satis�es or not some properties by applying formal veri�cation

techiques to models of the VNF themselves. However, this technique assumes that the model represents the


real behavior of the VNF, hence it is not able to guarantee that the VNF code is compliant with the model.

A possible direction for future investigation in this respect could consist in integrating remote attestation

techniques [55] in our execution environment, exploiting an external machine to verify the correctness of

the running software.

VNF resolver

The VNF resolver is the component of the un-orchestrator that interacts with the VNF repository in

order to select the VNF images to be instantiated. The VNF resolver is used in cases where the NF-FG

indicates only some features of the required functions (e.g., �rewall with 3 ports). In this case, the VNF

resolver selects the best image through the following steps: First, the VNF resolver requests from the VNF

repository the templates of all VNFs implementing the function (e.g., �rewall). Subsequently, the VNF

resolver selects the best VNF that according to the template satis�es the constraints/attributes associated

with the function in the NF-FG (e.g., 3 ports and support for 1Gb/s), matches an execution environment

supported by the UN and requires resource levels (e.g., RAM, CPU) available on the UN.

Notably, the selected VNF may consist of a single image or it can be a new service graph composed of

a number of other functions arbitrarily connected. In other worlds, the template may describe the NF as

another NF-FG. In this case, the VNF resolver recursively repeats the operations described above for each

function that is part of the new sub-NF-FG, until all required VNF images have been selected.

Internal Bus

As shown in Figure 5, the UN includes an internal message bus implemented though DoubleDecker (DD).

Although in the picture only the un-orchestrator and the monitoring manager plugin (Section 2.4.9) are

connected to such a bus, NFs may be connected to the DD bus as well, for instance in order to receive

monitoring alarms or con�guration parameters.

In the current version of the prototype, the DoubleDecker broker and the other modules attached to the

DoubleDecker network are connected between each other by means of the docker0 bridge, which is created

by the Docker daemon when it starts. Hence, only VNFs running in Docker containers can be currently

connected to the bus; a more generic approach that supports all NF types is let as a future work.

Monitoring Manager Plugin

Themonitoring manager plugin (MMP)(developed in WP4 and described in [29], section 3.3.1) is in charge of

managing the modules that (i) measure some metrics of the deployed service (e.g., CPU/memory consumed

by VNFs), and (ii) generate alarms when speci�c events occur or thresholds are crossed. MMP can be

con�gured in order to measure speci�c metrics through an instruction string speci�ed in the NF-FG and

written according to the MEASURE monitoring language (described in [29], section 3).

After the deployment of the NF-FG, the MMP instantiates and con�gures the proper monitoring functions

(MFs) that actually monitor the required metrics and generate alarms on the UN internal bus, such as

Google cAdvisor [13] and the Ramon (described in [29], section 4).

The MMP, in addition to starting and con�guring the proper MFs, receives alarms through DD, aggregates

the received information as required by the MEASURE instructions, and propagates them again on the DD


bus. This way, the aggregated information can reach the interested NFs. For instance, a NF may exploit the

monitoring results to require an update of the NF-FG, so that it can properly react to the received event.

UN Data Plane Design Considerations

This section describes the main features of the data plane of the Universal Node, namely: (i) the possibility

to directly interconnect VMs by letting the virtual switch out of the picture, in case the service to be

deployed satis�es some constraints; and (ii) the Ericsson Research Flow Switch, which targets high-speed

packet networks.

Before detailing these aspects of the UN data plane, it is worth mentioning that such features, although

implemented in the UN, are orthogonal to the UN itself. In fact, they are almost all compartmentized and

implemented in the virtual switch and require little support for an orchestrator.

VNF Communication

Figure 11 illustrates a service graph and one of its possible implementations on the UN: all NFs are executed

inside virtual machines interconnected through the vSwitch. According to the �gure, the service graph can

specify two types of connections between network functions: point-to-point (p-2-p in the following) and

point-to-multipoint. While the latter actually requires the vSwitch in order to classify and send tra�c to

the proper next VNF, the former could be implemented with a direct communication path, hence taking

the vSwitch out of that portion of the data plane. This may result in higher throughput and lower latency,

as well as in lower resource consumption thanks to the CPU cycles saved by avoiding a further pass in the

vSwitch.

FirewallNetworkmonitor

Web cache

Non-web traffic

vSwitch

Network monitor

VM

Firewall Web cache

VM VMWeb traffic

Non-web traffic

a)

b) Service graph to be deployed

All traffic

point-to-point connection

Figure 11: Tra�c crossing network functions: (a) the service graph; (b) its implementation on the UN.

We extended OvS, and more speci�cally the OvS version based on DPDK so that it can accelerate transpar-

ently and dynamically packet exchanges between two VMs, by creating a direct connection between them

in case of p-2-p links. Transparency in this context refers to the possibility for an application to exploit

direct communication without any knowledge of this optimization, and for an OpenFlow controller to attach

to the OvS as per usual. Dynamicity refers to the capability to either create a direct VM-to-VM channel or

return to a traditional VM-to-vSwitch-to-VM implementation on the �y, based on the runtime analysis of


the graph that is being instantiated or modi�ed.

Our framework to create direct connections between VNFs has been designed to operate in case of VNFs

implemented as DPDK-based applications executed inside VMs and connected to the OvS forwarding engine

through dpdkr ports. The module handles packets according to the content of its forwarding table, which

can be con�gured with OpenFlow flowmods. dpdkr ports are implemented using shared memory and are

exposed to the VM through ivshmem devices; moreover, applications access dpdkr ports using a poll mode

driver (PMD).

As shown in Figure 12, we modi�ed the dpdkr port to include a standard channel connected to the OvS

forwarding engine, and the optional bypass channel that is directly connected to another VM. PMD has

been modi�ed too, so that the same instance can handle both channels and expose them as a single dpdkr

port to applications, which are not aware of the actual implementation of that port. Moreover, a new p-2-p

link detector module has been added to OvS, which analyses each flowmod received by the vSwitch in order

to dynamically detect the creation/destruction of a p-2-p link between two dpdkr ports.

KV

M/Q

EMU

KV

M/Q

EMU

p-2-p link detector

VM1

DPDK

DPDK Application

dpdkrPMD *

VM2

DPDK

DPDK Application

dpdkrPMD *

Open vSwitch*

forwarding engine

un-orchestratornetwork manager

un-orchestratorcompute manager

* Modified to support transparent inter-VNF communication

dpdkrPMD *

dpdkrPMD *

virtio-serial virtio-serial

dp

dkr

1 standardchannel

dp

dkr

4 bypass

channel dpdkr3dpdkr2

Figure 12: Implementing direct channels between VMs in OvS.

When the VM is created (e.g., by the un-orchestrator), it is connected to dpdkr ports that have only the

standard channel. When the vSwitch detects the creation of a p-2-p link between two VMs, it creates a new

pair of dpdkr bypass channels mapped on the same piece of memory, shared by both the communicating

VMs. This way, the two VMs are directly connected and able to exchange packets without the intervention

of the OvS forwarding engine.

The two new bypass channels must be plugged into the proper VMs, and assigned to the proper PMD

instance. Since OvS does not know which VM is attached to a speci�c port (it just knows ports and the rules

used to forward packets among them), for these operations the vSwitch has to rely on the un-orchestrator,

which receives requests from OvS to: (i) plug the bypass channel (as an ivshmem device) into the VM by

interacting with QEMU; (ii) con�gure the PMD instance to send/receive packets through the bypass channel,

by means of a control channel based on a virtio-serial device.

Notably, in case of dpdkr port involved in a p-2-p link, the PMD can still receive packets from the standard

channel. In fact, it may happen that the OpenFlow controller sends to OvS an OpenFlow packet-out

message containing a packet that must be sent through such a port; in this case, OvS uses the standard

channel to provide the packet to the VM.

Finally, when the p-2-p link detector recognizes that a p-2-p link does not exist anymore, the bypass


channel is removed from the involved VMs, and the proper PMD instances are con�gured to use only the

standard channel.

To maintain compatibility with external entities such as the OpenFlow controller, OvS exposes the two

(standard and bypass) channels as a single (standard) dpdkr port, so that such entities can continue to

issue commands involving dpdkr ports as they usually do (e.g., get statistics, turn them on/o�), without

noticing any change in their actual implementation.

In order to export statistics related to ports and �ows implementing a p-2-p link, the PMD has been extended

so that, each time a packet is sent through the bypass channel, it increases the counters associated to that

OpenFlow rule and port, which are stored in a shared memory area. When OvS needs to export statistics, it

reads the proper values from that shared memory. The vSwitch is in fact not able to count statistics related

to p-2-p links by itself, as it is not involved in moving packets �owing through these connections.

Ericsson Research Flow Switch

The Ericsson Research Flow Switch (ERFS)[44]1 is an optimized OpenFlow pipeline built directly for sup-

porting the OpenFlow 1.3 speci�cation without any legacy support for earlier versions. High performance

is the main focus, and the following areas were selected for optimization:

• lookup / classi�cation: in ERFS the lookup algorithm is selected optimally on a per table basis. This

means that based on the incoming rules to a given table a real-time decision takes place that tries to

�nd the best possible classi�cation algorithm for that table.

• tunneling : tunnels are supported inside the pipeline with experimenter actions. This is more e�cient

than port or VNF based tunnel handling, especially with the just-in-time (JIT) code generator (see

next bullet).

• JIT : ERFS uses just-in-time linked code for executing parsing, (some) lookup algorithms and the

actions that are to be executed in the given table or group. This means that the packet processing

code is generated dynamically from code templates for the most widely used OpenFlow actions -

regardless whether they are mandatory, optional or experimental.

ERFS supports the mandatory OF v1.3 structures and commands, some optional and also some experimental

features and extensions. Some of the most important OF features and extensions are listed below:

• stateful and stateless load balancing with select groups - behavior and key �elds are con�gurable

• rate limiting with meters (1 band, drop-only)

• matches and set-�eld on all �elds, including metadata and tunnel-id

• built-in tunneling support: QinQ, GRE, VxLAN, MPLS, GTP (push/pop)

• NFV support with high-performance, dynamically created virtual ports: ivshmem, vhost-user, kni

• support for logical switch instances (LSIs) and zero-overhead inter-switch connections; each LSI has

its own OpenFlow port allowing multiple controllers

1The text of this section originates from [34], and is reported in this deliverable for completeness.


• defragmentation handling with special virtual port

• 3rd party plugin (shared library) support for experimental actions, instructions, and messages plus

context and memory management support for these 3rd party plugins

VNF Repository

The VNF repository is the module that contains the template and the image for a number of network

functions. The VNF template describes a speci�c NF image in terms of functionality implemented (e.g.,

�rewall, NAT), amount of physical resources required on the node in order to execute such an image (e.g.,

CPU, memory), required execution environment (e.g., KVM hypervisor, Docker engine, etc.), number of

virtual interfaces and associated technology, and more. The VNF image changes instead according to the

technology implementing the NF. For instance, it is the VM disk in case of virtual machines, a set of scripts

in case of NNFs, and so on.

As described in Section 5, the un-orchestrator can interact with one or more VNF repositories that can

be either installed on the UN or on a remote server in order to retrieve the template of suitable virtual

and/or native NFs implementing a speci�c function. Particularly, the un-orchestrator contacts the VNF

repository both when it has to export the functional capabilities of the UN, and when the NF-FG requires

the instantiation of a NF without any template associated with it.

Supported Platforms

The portability of the UN has been proved by installing the software on multiple hardware platforms with

very di�erent characteristics. For instance, a domestic home gateway (Netgear R6300v2 with 800 MHz

dual-core ARM Cortex A9 CPU, 128 MB �ash 256 MB RAM, 128 MB �ash and 256 MB RAM, 4 GbE

LAN ports, IEEE 802.11 b/g/n 2.4GHz and IEEE 802.11 a/n/ac 5.0GHz, and 1 GbE WAN port has been

demonstrated to execute the UN compiled for the OpenWrt architecture. Another example is the use of

the un-orchestrator on CarOS, an embedded Linux distribution targeting carrier networks, running on a

Banana Pi R1, a single board computer with an A20 ARM Cortex-A7 dual-core CPU, 1 GB RAM and 5

GbE ports, which are connected to a con�gurable switch chip [34].

Given the limited hardware capabilities of these boxes, at the time of this writing the un-orchestrator was

able to launch only NNFs with the standard OvS used as a vSwitch. In addition, service access points such as

tunnels (e.g., GRE) and VLANs were supported, hence enabling to connect the UN to external domains and

to create complex services requiring the stitching of multiple sub-graphs, spanning across multiple domains.

Two professional CPEs were used as well. First, we targeted the Freescale Hawkeye HK-0910 (Freescale

QorIQ T1040, 1.2GHz (four e5500 cores), 64MB NOR Flash, 2GB RAM DDR3L-1600), featuring also IPsec

and L2 switch hardware acceleration. The software environment was based on the Linux Yocto project, which

uses recipes to assemble together the required packages and create the software image that will be executed

on the hardware platform. The overall software setup was very similar to the previous boxes, hence only

NNFs are enabled, although the platform should support also virtualization. In particular, the IPsec NNF

was able to control the hardware accelerator, hence it was able to provide IPsec transport at wirespeed.

The second professional CPE was a Tiesse Imola 5 with a hardware platform based on a Ikanos single core

CPU (Ikanos Fusiv Core Vx185, single core MIPS 34Kc V5.4 CPU at 500MHz, 256 MB RAM, 256 MB �ash

memory, with xDSL acceleration) supporting OpenWrt for the running software image. However we had to


Figure 13: ETSI NFV architecture

use an old version of the Linux kernel (3.10.49) because it was the latest version supported by the Ikanos

drivers needed to control the xDSL interface. This prevented us from creating GRE tunnels in OpenvSwitch

due to a known kernel version incompatibility. In this case the UN was able to create VLANs only toward

external domains.

In addition to the aforementioned limited-resource boxes, the UN was tested on several standard Intel servers,

ranging from single CPU i7 machines to dual-processor Xeon platforms using the standard compilation chain

documented in the UN repository and based on Ubuntu 14.04 LTS. All features, including several supported

vSwitches were turned on and tested.

UN and the ETSI NFV Architecture

The European Telecommunications Standards Institute (ETSI) has proposed a reference model for the NFV

architecture [28] (Figure 13) that de�nes the main functional blocks, associated reference points, and de-

scriptors (e.g., service descriptors, VNF descriptors, etc.). The main blocks include an orchestrator (NFVO)

that handles the lifecycle of the services, a (set of) VNF manager (VNFM) in charge of managing one or

more VNFs (i.e., start, stop, con�guration, scale in, scale out), a Virtual Infrastructure Manager (VIM) that

actually implements the commands triggered by the NFVO and/or the VNFMs on the physical infrastruc-

ture. Finally, the NFV infrastructure (NFVI) includes both physical and virtual (hypervisor, vSwitch, etc.)

resources and actually executes the service.

At �rst look, the UN is an ETSI infrastructure node as it includes both a vSwitch and a set of execution

engines with its own VIM that can be driven by an external NFVO sitting on top of many UN nodes. However,

viewed in this way much of our proposal potential is left untapped by such an external orchestrator. In fact,


NFVI

Orchestrator

Network FunctionForwarding Graph(NF-FG)

VNF description,Capabilities,and resources

VNFcatalogue

DPDKengine

Nativefunctions

VNF3VNF2VNF1

Orchestrator

VirtualizedInfrastructure Manager

(Vim)Networkmanager

Computemanager

VNF resolverVNF scheduler

REST server DoubleDeckerclient

Resource manager

vSwitch

Or-Vnfm

Or-Vi

Ve-Vnfm-Vnf

Nf-Vi

KVMhypervisor

Dockerengine

Openflow, OVSDB, Libvirt, Docker, …

Universal Node

un-orchestratorInternal calls

VNF repository

VNFMs

VNFM1VNFM2

VNFM3

ETSI block

NF-FG

VNF specificcommands

Virtualization layer

Figure 14: Mapping UN components on the ETSI NFV architecture

as shown in Figure 14, which maps the functional blocks and reference points of the ETSI NFV architecture

to the UN components, our node includes also an NFVO and a number of VNFMs.

Notably, the NFVO inside the UN does not preclude the existence of one (or more) orchestrators on top of

many UN nodes, although this orchestrator hierarchy is not explicitly de�ned in the ETSI architecture. The

NFVO inside the UN can bene�t the entire telecom operator architecture, as it can further optimize service

deployment e.g., by selecting the best available implementation for a NF and at the same time provide more

scalability, since an external orchestrator does not need to know the intricate details e.g., the VNF images

actually implementing a NF which are instead only known by the un-orchestrator.

While the un-orchestrator implements the NFVO, the VIM, and the NFVI, the VNFMs are not actually

part of the UN architecture. Such VNFMs are in fact VNFs themselves that are actually part of the NF-FG;

this is the reason why the dashed box in the picture is extended to include both the VNFs and the VNFMs.

It is worth mentioning that, from the un-orchestrator point of view there is no di�erence between data

plane VNFs and VNFs implementing the VNFM functionality, as both types are run in some execution

environment and connected to the vSwitch with a number of vNICs.

Finally, the NF-FG represents an implementation of the ETSI network service descriptor (NSD), as it

describes the service in its whole, i.e., both the compute (NFs) and network (links) parts.


Figure 15: System view of the Elastic Router prototype

UN-based Elastic Router

The elastic router use case (described in [37], section 2.5.1, and in [48]) demonstrates the deployment

process of an NF-FG in the UN, in conjunction with monitoring and troubleshooting features. Marked with

yellow boxes in Figure 15, the deployed NF-FG consists of an elastic router control app and a number of

elastic router OvS instances. During the service lifetime, monitoring tools will trigger the elastic router

control app to scale-out and scale-in elastic router OvS instances (i.e. OvS VNFs), while troubleshooting

tools enable e�cient debugging of the elastic-router.

The remainder of this section details the parts of the UN that have been exploited, or even added, to the

architecture described in Section 2.4 in order to support the elastic router use case.

As shown in Figure 15, ESCAPE [25, 54] realizes the service layer and the global orchestrator layer where

the initial NF-FG (consisting of the elastic router control app and one elastic router OvS) is generated.

ESCAPE sends the NF-FG to the UN according with the XML-based format de�ned in WP3. The virtualizer

module has been added on top of the un-orchestrator that receives the NF-FG commands through its

northbound interface, based on the virtualizer library de�ned in UNIFY ([21], section 2) that implements

the o�cial NF-FG speci�cation. Second, the virtualizer converts those commands in the NF-FG formalism


natively supported by the un-orchestrator (Section 2.4.1). Finally, through its southbound API, the

virtualizer sends the equivalent command to the un-orchestrator. It is worth pointing out that the

virtualizer northbound interface implements both the Sl-Or and the Cf-Or reference points of the UNIFY

architecture, as it allows the un-orchestrator to receive commands both from ESCAPE and from the

elastic router control app. The latter in fact needs to interact with the un-orchestrator in order to scale

up/down the number of elastic router OvS instances deployed.

All components instantiated on the UN to implement the elastic router use case, i.e., the VNFs (elastic router

control app and elastic router OvS) and the monitoring functions (the aggregator, Ramon, and cAdvisor),

are deployed as Docker containers and connected to the DD bus by means of the docker0 bridge created

by the Docker daemon (Section 2.4.8). Speci�cally, in the Elastic Router use case, the following MFs can

be deployed by the monitoring manager according to the metrics indicated in the MEASURE string: Google

cAdvisor [13] and Ramon (described in [29], section 4). The former collects resource use measurements

(e.g., CPU and memory) of individual VNFs, while the latter implements scalable congestion detection based

on the analysis of the tra�c rate distribution on individual links at di�erent time scales.

Related Work

Based on the ETSI architecture, both industry and academia introduced several prototypes and proofs of

concept (PoCs) to deploy network services and functions. However, while the UN de�nes the software

architecture of an infrastructure node that also includes orchestration and VIM functionalities in addition

to a number of VNFMs that can be instantiated as VNFs and are part of a service graph (see Section 2.8),

most of said earlier work de�ne the architecture of the NFVO that sits on top of many infrastructure nodes

and deploys network services through OpenStack [16]. In this cases, OpenStack acts as a VIM, exploited

by the NFVO to properly instantiate the service on the OpenStack physical infrastructure.

Proposals employing Openstack as a VIM and VNFI include OpenStack Tracker [17] which implements

NFVO and a generic VNFM and uses the Topology and Orchestration Speci�cation for Cloud Applications

(TOSCA) [18] as a formalism to describe the several aspects of the service to be deployed, which is an

implementation of the ETSI descriptors. Open Baton [14] de�nes a NFVO and a generic VNFM and can

be installed on top of existing cloud infrastructures like OpenStack. OpenMANO [15] implements instead

its own VIM (in addition to an NFVO sitting on the top of many infrastructure node), although it supports

OpenStack as well.

It is worth noting that these proposals are orthogonal to our work on the Universal Node, as they mainly

operate on top of the infrastructure nodes and can be extended to interact with the un-orchestrator

instead of the OpenStack environment.

Cloud4NFV [52, 53] is instead a platform for managing network services in a cloud environment, which

covers both the NFVO and the VIM de�ned in the ETSI architecture. The latter includes both a data

center controller (OpenStack) and a WAN controller (OpenDaylight) to interconnect parts of the service

deployed in di�erent data centers. OpenStack and OpenDaylight are used as VIM by vConductor [51],

while SONATA [27] can support di�erent VIMs by means of adapters and recursion at the orchestrator

layer. Recursion is of course considered in UNIFY, in which two di�erent orchestrators that can sit on top

of di�erent infrastructures [25, 54], [24].

Compared to the UN, OpenStack presents several limitations when used as VIM and NFVI. For instance,

OpenStack does not have the concept of network service, which means that each VNF is just seen as a VM


(or container) to be executed independently from the others on one of the servers forming an OpenStack

cluster. Hence, VMs are allocated to server/CPU cores without taking into consideration connections

between VNFs in the service to be deployed, which may result in poor performance for the whole service.

An attempt to introduce the concept of service in OpenStack is described in [41], which also highlights

how the so-called network-aware scheduling is hard to be implemented in such an environment. Moreover,

OpenStack is not suitable for resource-constrained environments such as CPEs. In fact, the deployment of

NFs on a CPE would require that such a device is a Nova compute node, which requires a lot of resources

and it does not support the deployment of NNFs. Instead, OpenStack can be used to implement a virtual

CPE such as in [26], in which the CPE functionalities are implemented as VNFs actually deployed in an

OpenStack-based data center.

Literature on NFV also includes some proposals for software architectures of network infrastructure nodes

that, similarly to the UN, can run network functions. NetVM [35] is a platform designed to e�ciently transfer

packets between NFs running inside virtual machines. NetVM de�nes its own virtual switch based on the

DPDK framework, which can transfer packets with zero-copy between trusted VMs, while a copy is required

to transfer packets between untrusted VMs. It is worth mentioning that existing network applications are not

supported by NetVM as they must use a library that hides the communication with the NetVM framework.

The NetVM architecture includes a NetVM manager that can talk with an overarching orchestrator by

means of a message based protocol similar to OpenFlow, although no more information is provided in [35].

OpenNetVM [56] is instead a platform built on top of NetVM, which executes NFs within Docker containers

and provides an high-level abstraction to compose NFs in service chains, control the packet �ows, and manage

NF resources. Although OpenNetVM presents some similarities with the UN, it is still at the early stage

and few information about it are available.

Further, [42] presents a platform for high-performance NFV. Particularly, it uses the VALE [47] vSwitch to

provide packets to ClickOS virtual machines, i.e., Xen-based [20] virtual machines executing a Click [40]

program running on top of a minimal operating system. As the node software architecture does not de�ne

a high-level API that can be exploited by an orchestrator to instantiate VNFs, this platform is just an

implementation of the NFVI.

nf.io [22] is a platform that uses the Linux �le system as an interface that can be used to express NFV

management and orchestration operations and act as an API towards the NFVO. Particularly, it de�nes the

semantics of �les and directory structures to perform operations such as VNF deployment, con�guration,

chaining and monitoring. Similarly to the UN, nf.io can execute NFs as processes on physical machine, VMs,

Docker and LXC containers. Forwarding rules can be con�gured both with the iptables Linux facility or

with OpenFlow for tra�c paths implemented using OvS.

As the UN is well suited to run also on resource constrained environments such as a CPEs, the tethered

Linux CPE [49] is also considered related work. However, such a platform has the limitation of being able

to only run NFs implemented as eBPF programs loaded into the Linux kernel.

Universal Node Use Cases

Elastic Router

The Elastic Router use case is described in detail in Deliverable D3.3 [37]. We presented above in Section

2.9 the support required in this use case by the un-orchestrator; see also Figure 15.


Evidently, the Elastic Router use case focuses primarily on the control and management planes. As such, we

did not proceed with separate measurements as the Elastic Router use case is identical from a data plane

perspective with the �L3 routing� case as described in Deliverable D5.6 [34], section 3.1.3.

Broadband Network Gateway

The UN Broadband Network Gateway (BNG) use case is quite complex in its entirety. To address this in a

comprehensive yet not cumbersome manner while at the same time being able to evaluate UN performance

in a tractable manner we created di�erent setups, as explained below.

QinQ to GRE Tunneling and Routing

We de�ned a simpli�ed BNG pipeline which terminates QinQ tunnels on the user side and converts them

into GRE tunnels which are terminated in a carrier-grade NAT network element. The network setup is

illustrated in Figure 16, while the internal hybrid pipeline is depicted in Figure 17.

Figure 16: Broadband Network Gateway with external NAT: end-to-end scenario


Figure 17: Broadband Network Gateway with external NAT: pipeline setup

We will not delve into further details in this deliverable as this setup was already described in Deliverable

D5.4 [?], Section 4.5. In this deliverable we instead focus on the performance evaluation results reported in

Section 5.

Built-in NAT Functionality

In this case the BNG pipeline contains NAT as a VNF. The full BNG pipeline can be seen on Figure 18.

After receiving the packet, the packet header is parsed. The relevant �elds are the source and destination

MAC addresses, destination IP and TTL, in total 19 bytes.

In the upLink case, Ethertype is checked if it is an ARP packet or not. If the packet is IPv4, a NAT function

changes its addresses accordingly depending on whether it is in in uplink or downlink. After Network Address

Translation, simple routing is performed on the packet where the next-hop is determined through a longest

pre�x match (LPM) lookup, and after changing the L2 addresses and TTL, the packet is transmitted.

In the case of downlink tra�c, before route lookup is performed, for example, tra�c policing (OpenFlow

meter) can be applied, which is also based on the IP address of the user.


Figure 18: Broadband Network Gateway application with built-in NAT

UN Modelling

Modelling results are used to compare the performance of real-life implementations to the theoretical maxi-

mum performance on di�erent hardware platforms. A performance model of a basic switching scenario on an

Intel Xeon CPU was presented in the paper Cross-Platform Estimation of Network Function Performance[50].

The basic approach was to determine Elementary Operations (EOs), which can be used as building blocks

for speci�c scenarios. The cost of these EOs can be determined on di�erent architectures, and therefore

the performance of VNFs modeled by the operations can be estimated. This section is the continuation of

the previous work. First some operations are reconsidered and two new EOs are presented. Then Network

Address Translation, Border Network Gateway, and IPSec ESP protocols are modeled as network functions.

Finally the elementary operations as well as the VNF models are mapped to Intel Xeon, Intel Atom, and

Cavium ThunderX CN88XX architectures, and the theoretical performance of these network functions are

estimated.

Considerations for Elementary Operations

Data transfer between memory and the NIC

The previous mem_I/O elementary operation assumed that the CPU has a main role in transferring the

data from the NIC to the central memory. In this case, even the simplest packet processing operations would

take a large amount of core cycles. This is a more traditional way of data moving and is usually done by the

kernel with interrupt handling. However, DMA-like mechanisms such as the Direct Cache Access presented

by DPDK enable a more e�cient way of moving the data from and to the NIC. This way the CPU does not

have to care about how the packets will be moved to the memory, therefore it can focus only on the packet


processing. DPDK's DCA mechanism moves the packets directly to the L3 cache. The performance models

should also consider this the case, since this functionality is available on consumer hardware and provides

high performance. Di�erent architectures such as NPUs can also support DMA, which helps the processing

units to focus only on the packet processing. Considering di�erent I/O modes makes the model modular,

and when estimating performance on speci�c hardware, the approach that �ts most the given architecture

can be used.

Note that DMA-like operations may use shared resources, which can be a bottleneck (such as an L3 cache).

On traditional systems usually this is not a problem, but in highly parallel environments (tens or hundreds

of cores) this also has to be examined.

Computing checksums

The previous checksum EO described a correct way to calculate checksum on arbitrary sized data. However,

when it comes to packet processing, computing the checksum on a whole packet can be avoided in most of

the cases. Layer 4 checksums (TCP and optionally in UDP) are calculated on the whole payload and some

L3/L4 header �elds, but when only some bytes change in the relevant data, the checksum can be computed

incrementally from the previous correct value. The incremental update is described in RFCs 1141 and 1624.

This way, the checksum computation can be very fast compared to the case when it is recalculated on

the whole packet, especially with large packet sizes. A new EO can be de�ned for incremental checksum

computing: inc_cksum(V16n,V32n), where the parameters are the number of changed 16 bit and 32 bit

values, respectively.

Hash lookup

Although the hash algorithm presented previously is e�cient, several CPUs implement these algorithms in

hardware, which means execution becomes more faster. Since this functionality is available in most of the

new processing units (for example, hardware CRC32 hashing was introduced on Intel CPUs in 2008, with

the Nehalem microarchitecture [36]). CRC hashing is also e�cient when implemented in software, and used

in a wide range of applications. The output of CRC needs to be converted to a suitable index to an array

which stores the values, otherwise the array size would be too large. In practice, simple methods can be

used to convert the output of the CRC to a proper index, such as using the last N bits of the CRC if the

array size is a power of 2. If there is collision, perfect hash can still be created with a larger table, but this

is a computation overhead only when a new entry is added. The model should consider also these solutions,

and when it comes to mapping EOs to hardware, use the best available solution for implementing hash

lookups.

Longest Pre�x Match lookup

In this section a new Elementary Operation is presented for Longest Pre�x Match (LPM) lookup. One of the

most common operations in network functions is routing. For a simple router, it is essential that for a given

destination IP address a lookup for the next hop is performed as fast as possible. Even complicated use-

cases such as gateways with many di�erent functionalities almost always contain routing in their processing

pipeline.

IP address matching in routing is performed by a longest pre�x match (LPM) lookup. This means that a


forwarding rule should be applied to a packet when it belongs to the rule's network, and the network pre�x

is the longest of all such rules. This can be achieved by de�ning the priority of the rules by the length of

their network pre�x, and applying the �rst matching rule to a packet based on the destination IP.

There are several di�erent algorithms for �nding the �rst matching rule. Intel's Dataplane Development

Kit (DPDK [12]) currently uses the DIR-24-8 algorithm for performing LPM lookups [33]. One of the main

advantages of this algorithm is that in most cases it is able to �nd the �rst matching rule with only one

memory access. This speed, however, comes at the cost of space: because of the redundant storage of the

rules, this method can be quite memory space-intensive, based on the number, and more importantly, on

the pre�x length of the routing rules. Still, the very fast lookup this algorithm provides heavily outweighs

this space constraint.

Modelled use cases

After describing the required elementary operations the description of more complex use cases will be carried

out in the following sections.

Network Address Translation (NAT)

This use-case models a full-cone NAT, a quite simple Network Function. After receiving a packet, we need

to check if it is part of the uplink (from internal to external network) or downlink (from external to internal)

tra�c. We assume that in both cases, the information required to forward the packet is available, i.e. a

similar packet was already forwarded and an entry is present in the NAT lookup tables. If the tra�c is

uplink, the source IP and source TCP/UDP port is hashed, and based on the hash value, an array access

is made, which contains the new source port for the packet. In the case of downlink tra�c, only one array

access is needed based on the destination port. The result of this lookup is the new destination IP and port

for the packet. After we have the required information, the proper IP and TCP/UDP �elds are changed

according to the rules of NAT, and a new checksum is calculated. In both cases, a 32-bit �eld (IP address)

and a 16-bit �eld (Layer 4 port) are changed in the headers. The uplink and downlink models can be seen

on Figure 19 and Figure 20.

mem_I/O(6,0)

parse(4) //Src. IP

parse(2) //Src. port

hash_lookup()

array_access()

encapsulation(4) //Src. IP

encapsulation(2) //Src. port

inc_cksum(1,2) //IP + L4

mem_I/O(6,0)

Figure 19: NAT UpLink model

mem_I/O(6,0)

parse(4) //Dst. IP

parse(2) //Dst. port

array_access()

encapsulation(4) //Dst. IP

encapsulation(2) //Dst. port

inc_cksum(1,2) //IP + L4

mem_I/O(6,0)

Figure 20: NAT DownLink model


Border Network Gateway

The BNG use-case has been described in 3.2. It contains several elementary operations and (for the IPv4

case) the NAT function that was described above. The uplink and downlink EOs can be seen on Figure 21

and Figure 22.

mem_I/O(19,0)

parse(2) //Ethertype

NAT()

parse(4) //Dst IP

lpm()

parse(1) //TTL

parse(6) //Src. MAC

parse(6) //Dst. MAC

decrease(1) //TTL

encapsulation(1) //TTL

encapsulation(6) //Src. MAC

encapsulation(6) //Dst. MAC

mem_I/O(19,0)

Figure 21: BNG UpLink model

mem_I/O(17,0)

NAT()

parse(4) //Dst IP

lpm() (+ opt. shaping )

parse(1) //TTL

parse(6) //Src. MAC

parse(6) //Dst. MAC

decrease(1) //TTL

encapsulation(1) //TTL

encapsulation(6) //Src. MAC

encapsulation(6) //Dst. MAC

mem_I/O(17,0)

Figure 22: BNG DownLink model

IPSec ESP protocol

In this section a model for Encapsulating Security Payloads (ESP) IPSec protocol is presented [39]. The

ESP protocol is able to provide con�dentiality, connectionless integrity and data-origin authentication. ESP

operates directly on top of IP. There are several cryptographic algorithms which can be used for providing

security. One of the standard solutions is to encrypt the packet data with a su�ciently strong encryption

algorithm, and apply a Message Authentication Code (MAC) at the end of the message. The encryption

provides con�dentiality, and the MAC is responsible for integrity and origin-authentication. For encryption,

AES-256 in CBC block chaining mode is recommended in IPSec ESP [30, 32]. For the message authentica-

tion, HMAC-SHA-256-128 is allowed [46, 31, 38]. Our model will incorporate these cryptographic functions

to provide IPSec ESP protocol. The model for this operation is quite simple, as it can be seen on Figure 23.

Note that IPSec ESP mode requires the whole packet to be processed. Therefore, during the mem_I/O

operation, usually it is bene�cial to move the whole packet content into L1 cache if available. The imple-

mentation of the cryptographic algorithms depend heavily on the target architecture. Several processors,

such as Intel or ARMv8 have hardware-implemented cryptographic primitives. These are naturally faster

than the software implemented versions, and has to be considered during the mapping of the model to the

speci�c hardware.


mem_I/O(n,0)

AES-256-CBC(n)

HMAC-SHA-256-128(n)

mem_I/O(n,0)

Figure 23: IPSec ESP model

Mapping to Intel Xeon Sandy Bridge

In this section the model elements are mapped to an Intel Xeon E5-2630 CPU[1]. First a brief introduction

is given to the Sandy Bridge architecture. Then the EO costs are determined, and �nally the performance

model is compared to real measurement values.

The Sandy Bridge architecture

The Intel Sandy Bridge architecture was introduced in 2011 as a further development of the Nehalem

design. The architecture contains improved branch prediction, new vector instructions, 256-bit �oating

point units and out-of-order execution with micro-operations. The Sandy Bridge architecture is based on

32 nm manufacturing technology, while its successor, the Ivy Bridge processors use a 22 nm process. Sandy

Bridge CPUs have 2-8 cores (with possible hyper-threading), 32 kB L1i and L1d caches and 256 kB L2

cache (both instruction and data).

Figure 24: Simpli�ed pipeline of Sandy Bridge

The simpli�ed overview of the Sandy Bridge pipeline can be seen on Figure 24. The pipeline is similar to

earlier architectures, but one of the main changes were adding a micro-operation cache to the instruction

decoders, and also several improvement were made on other units. The standard CPU instructions are

decoded into micro-ops by the decoder units. These micro-ops can be executed in one of the six ports,

based on their type. For example, ALU instructions can be executed on three di�erent ports (Port 0,

Port 1 and Port 5), memory load instructions on Port 2 and 3, or memory store operations on Port 4.


Some instructions can only be issued into a speci�c port, which can slow down the execution in certain

circumstances. The pipelining factor of a CPU depends heavily on the port characteristics, i.e. how many

ports are able to execute speci�c instructions. However, it is also important that the ports are utilized as

e�ciently as possible. In order to achieve this, Intel uses an out-of-order execution in Sandy Bridge CPUs.

This means that the out-of-order engine detects dependency chains in the instructions to be executed, and

issues the instructions to di�erent ports in an optimal way. The three major components of the out-of-order

engine are the Renamer (moving instructions to the execution core), the Scheduler (queuing and dispatching

instructions into ports) and the Retirement (retiring instructions and handling faults).

The interested reader can �nd more information about the Sandy Bridge architecture in the Intel 64 and

IA-32 Architectures Optimization Reference Manual [2]. From the modeling aspect, the important part of

the architecture are the ports, and their execution capabilities. By knowing which instruction on which port

can be executed, the attainable pipelining factor can be determined, and a performance estimation can be

given.

Fine tuning EOs

Data transfer

After the packet is in the L3 cache, the �elds required for the processing must be moved to L1/L2 cache.

Accessing L3 cache takes around 28 cycles. Reading a cache line to L1/L2 cache takes around 5 cycles,

whereas writing a cache line back needs 9 clock cycles after accessing the L3 cache [3]. The approximate

cost of the mem_I/O EO based on the previous assumptions is the following:

28 + [5|9] ∗ dmin(32KB,L1n)64B e for L1 cache

28 + [5|9] ∗ dmin(256KB,max(0,L1n−32KB)+L2n64B e for L2 cache

where the multiplier is 5 when reading, and 9 when writing data, and L1n/L2n is the number of bytes to

read/store in L1/L2 data cache.

Note that even though the performance of one core is represented, the scaling of the estimations to multiple

cores on this architecture is straightforward. The L3 cache and main memory are shared resources, but for

example the main memory on one channel can serve around 70 Mtps, as presented in [50]. Since there are

four memory channels for 6 cores per socket, it is safe to say that when cache memories can be utilized well,

the memory will not be a bottleneck regarding the network application's performance. On architectures

with higher core numbers, or stronger constraints on shared resources, considering memory transactions is

essential when estimating performance.

Checksum

After implementing the incremental checksum calculation and looking at the generated code, we can see

that updating the checksum when a 16-bit value changed (such as a TCP port) requires around 15 clock

cycles. When a 32-bit value is changed, the update requires 20 clock cycles.

Hash lookup

Since the used Intel CPU supports the SSE 4.2 instructions, a simple CRC32 hash can be computed with

the help of the hardware. According to the datasheets, the crc32 instruction requires 3 clock cycles [2].


Usually applications calculate CRC32 on 64 bit values, but using 64-bit registers as input parameters for the

crc32 is also possible. In most cases, the output of CRC is converted to a suitable array index as described

earlier, but since these are simple operations, in around 9-10 clock cycles the index should be calculated.

LPM lookup

For Intel Sandy Bridge, the required clock cycles and memory accesses for the LPM lookup can be seen in

Table 5, based on the implementation presented earlier. When the �ows are spread, they cannot be cached

e�ciently, therefore L3 or DRAM accesses are needed. As described earlier, branch prediction also can a�ect

the performance. On Intel Sandy-Bridge, according to measurements a branch misprediction costs around

17 clock cycles; this happens when a �ow matches a pre�x larger than 24 bits [3].

Pre�x Spread �ows Concentrated �ows

≤ 24 13 cycles +2 L3/DRAM access 21 cycles (L1) or 37 cycles (L2)

> 24 21(+17) cycles +3 L3/DRAM access 33(+17) cycles (L1) or 57(+17) cycles (L2)

Table 5: LPM execution cycles and memory accesses on Intel Xeon

NAT performance estimation

Based on the previously presented model and the cost of the EOs we can determine the required cycle count

for both uplink and downlink tra�c. The NAT uplink function should cost around 180 clock cycles based

on the model. The receive part of I/O takes 33 cycles (reading one cache line), reading the source IP and

port takes two L1 cache accesses. Hashing these values and an array lookup costs 39 cycles, writing the

new IP and ports is again two L1 accesses. The checksum computation costs overall 55 cycles, and �nally

copying the data back from L1 to L3 is 37 clock cycles. Calculating the required cycles for the downlink

tra�c can be done in the same manner. The only real di�erence here is that there is no hashing, which

reduces the cycle count by 10.

Note that these values are independent from the packet size. This is expected since the operations work

only on the header values, and the Layer 4 checksum also does not need to be calculated on the whole

packet thanks to the incremental checksum update. With the calculated values, one core of the Xeon CPU

should be able to handle 15.55 Mpps in uplink, and 16.47 Mpps in downlink tra�c.

BNG performance estimation

The performance estimates for the Border Network Gateway use-case can be estimated the same way. The

NAT function in the BNG use-case runs in a virtual machine. Although host and guest machines can have

shared memory areas, the I/O part must be taken into account again when running the NAT function, since

relevant header �elds has to be fetched into the L1 cache of the core which runs the NAT VNF. On the

host machine, transferring data between the caches is in total around 70 clock cycles. In the uplink case,

reading the Ethertype takes one L1 cache access. The cycles required by the NAT function are already

calculated. Extracting the IP address for the routing lookup takes again one L1 access. Assuming the �ows

are concentrated and only <= 24 pre�x �ows are used, performing an LPM lookup in average takes 30

cycles as described earlier. Finally, the TTL is decreased, which is a minimal overhead, and the source

and destination MACs are changed, this takes overall around 21 clock cycles when the destination MAC


is present in L1. In total, the BNG uplink processing costs 309 clock cycles. With similar calculations we

can determine that in the downlink case, processing requires around 295 cycles. By using these values, the

estimated performance is 9.06 Mpps in the BNG uplink use-case, and 9.5 Mpps in the down-link case on

the used CPU.

IPSec ESP performance estimation

As described earlier, the cryptographic primitives used by the IPSec ESP model are AES-256 in CBC mode

and SHA-256 [30, 46]. The used Intel CPU implements the AES-NI instruction extension, which is used

to enhance the encryption and decryption with the AES algorithm. During the performance estimation

of this primitive, it is assumed that encryption is implemented by these instruction instead of a software

implementation. Intel has upcoming instructions for accelerating SHA-performance on their processors, but

unfortunately the examined CPU does not have this feature yet. Therefore, a software implementation is

used for estimating the performance of hash computation.

Cost of AES-256 CBC

As described, the encryption algorithm for con�dentiality in the model for IPSec is AES-256 in CBC mode,

which conforms to the IPSec speci�cations. The encryption cost of an n-byte data block in clock cycles on

an Intel Sandy Bridge CPU is:

CAES256−CBC(n) = CFix + CV ar = 10 + d n16e ∗ (7 + 14 ∗ 7)

Note that the memory access times weren't considered during the cost calculation. This is because it is

assumed that data resides in the L1 cache (which is enough to contains even an 1500 byte packet fully), and

data transfer operations are executed on di�erent ports than standard ALU-operations. Therefore, transfers

can overlap with the AES encryption, allowing to reach an higher throughput.

OpenSSL AES-256 CBC performance

The modeled AES function can be validated by measuring the OpenSSL encryption speed and comparing it

to the estimated values of the model. For measuring OpenSSL, the OpenSSL speed application was used,

which is a built-in benchmarking tool in OpenSSL. The measurements were done on smaller data sizes due

to the payload size of standard network packets. The estimated throughput values and measurement results

can be seen on Figure 25. The maximum error of the model is 2.85% at the 1024 byte data size. This is a

su�ciently low error for modeling more complex use-cases with AES encryption as a component in them.

Measurements were also done with Callgrind, a built-in tool in Valgrind for gathering instruction execution

data, and cache simulation [4]. The tool con�rmed the assumption of low L1 cache-miss rate. When

running the benchmark for 3 seconds on 1 KB data size, 17.5M data reads occur, and the number of L1

cache misses according to Callgrind is 17, which is extremely low. When encrypting 16 KB elements, the

number of data reads is the same (since the throughput of the encryption is identical), and the number of

L1 cache misses is 257, which is still only 0.001% of the reads.


Size Estimation OpenSSL

16 B 2848.3 Mb/s 2835.0 Mb/s

64 B 3038.2 Mb/s 2974.8 Mb/s

128 B 3072.3 Mb/s 2997.8 Mb/s

256 B 3089.7 Mb/s 3007.4 Mb/s

512 B 3098.4 Mb/s 3023.9 Mb/s

1024 B 3102.8 Mb/s 3017.0 Mb/s

Figure 25: AES-256 performance estimation on Intel Xeon

Cost of SHA-256

For the authentication part in IPSec, HMAC with SHA-256 were chosen. Unfortunately there is no hardware

implementation of SHA-256 in Intel Sandy Bridge processors, therefore software implementation is used.

The base code used for modeling is the SHA-256 implementation of OpenSSL. The cost of an n-byte data

block in clock cycles on an Intel Sandy Bridge CPU is:

CSHA−256(n) = CFix + CV ar = 28 + d n64e ∗ 1156

OpenSSL SHA-256 performance

The model can be veri�ed again by using OpenSSL's benchmarking tool. Since the model is based on the

OpenSSL implementation, it is expected that the estimated and measured performance are close to each

other. The results can be seen on Figure 26. Note that at larger data sizes (512 and 1024 bytes), the model

actually slightly underestimates the real performance. The reason for this is that on small data sizes due

to the padding of SHA-256 the preprocessing work can take a signi�cant time, and probably the pipelining

factor is even higher than 2.5. However, still the error at 1024 bytes is only around 2.8%, which should be

signi�cantly small for our purposes.

Cache simulation was also measured using Callgrind. On 16 byte data size, from 14.6 million read accesses,

there were only 4 cache misses. When computing SHA-256 on 16 KB blocks, from 17 million reads only

261 missed the L1 cache.

Data Size Estimation OpenSSL

16 B 288.7 Mb/s 265.1 Mb/s

64 B 584.3 Mb/s 546.2 Mb/s

128 B 782.1 Mb/s 773.9 Mb/s

256 B 941.6 Mb/s 949.7 Mb/s

512 B 1048.4 Mb/s 1070.3 Mb/s

1024 B 1111.5 Mb/s 1143.9 Mb/s

Figure 26: SHA-256 performance estimation on Intel Xeon


Performance estimation of IPSec ESP

Based on the estimated cryptographic primitives a performance estimation can be given to the full IPSec

ESP protocol. The required cycles for the mem_I/O EO can be computed as presented earlier. Table 6

contains the estimation results for the IPSec ESP protocol.

Data size mem_I/O AES-256 HMAC1 HMAC2 Sum Kpps

64 B 70 430 3496 2340 6336 441.919

128 B 84 850 4652 2340 7926 353.267

256 B 112 1690 6964 2340 11106 252.116

512 B 168 3370 11588 2340 17466 160.311

1024 B 280 6730 20836 2340 30186 92.758

1500 B 392 9880 30084 2340 42696 65.578

Table 6: IPSec component clock cycles and throughput estimation on Intel Xeon

Mapping to Intel Atom

Intel Atom has quite powerful features especially in integer-focused computation despite its low power

consumption. However, there are many advantages on Sandy Bridge which makes it superior to Atom.

Sandy Bridge has six execution ports, three of them can be used for accessing ALUs or SIMD/�oating point

execution units, two can be used for load, and one for store instructions. This way, Sandy Bridge can service

two loads and one store simultaneously, per clock cycle. Atom only has two general ports, but there are

several instructions which can be executed only on one of them. It is also not possible to execute memory

read and write operations simultaneously on Atom, which is a huge drawback [5, 2].

Another great advantage of Sandy Bridge is out-of-order execution. The in-order execution unit decodes the

instructions into micro-operations. These micro-ops are sent into the execution ports by an out-of-order,

super-scalar engine. The out-of-order engine is able to detect dependency chains, and sends the instructions

in an optimal way to the ports while maintaining correct data �ow. Up to six micro-ops can be executed

at most per clock cycle. This means around 4 full instructions per clock cycle (or even 5 when a special

operation called macro-fusion is used), which can be easily achieved if the instructions have a su�cient

variability, i.e. the micro-ops can be sent to di�erent execution ports. Atom on the other hand time only

has an in-order execution unit, which means the CPU stalls when waiting for a resource (e.g. after a cache

miss), or when executing long-latency instructions. Since there are only two execution ports, the maximum

instruction per cycle count is two, and this is also hard to achieve, because the compiled code must be

optimized to the Atom architecture, so the instructions can be executed in pairs [5].

There are also di�erences in the memory structure between Atom and Xeon. One of the largest disadvantage

of Atom is that it does not have L3 cache. There is only a fast (3-clock access time), but smaller (24K)

L1 cache, and a uni�ed 1M L2 cache available. This means that in many cases where the L3 cache was

accessible, Atom has to use the slow main memory, which also can't be accessed for reading and writing at

the same time, as described earlier.

Since the instruction extensions used on Xeon are also available on Atom, the executed code on both

platforms are very similar. For this reason, we decided to convert the performance model from Xeon to

Atom. The main di�erences between the raw numbers are the lower L1 cache latency, but this is only a


minor change. Conversely, the fact that Atom has to use the main memory more often slows down the

performance quite heavily.

As described earlier, there are also major di�erences in the instruction execution between the two architec-

tures. By looking at the instruction per cycle count and the lack of out-of-order execution, we can assume

that the performance degradation by these factors is at least 50-55%. Note that this is probably still a weak

estimation, because of Sandy Bridge's complex and optimized instruction pipeline.

Changed EOs

The checksum and hash lookup EOs have the same clock cycle cost. The data transfer, however, is changed,

since L3 cache is not available on Atom. The new transfer costs are the following:

59 + [5|9] ∗ dmin(32KB,L1n)64B e for L1 cache

59 + [5|9] ∗ dmin(256KB,max(0,L1n−32KB)+L2n64B e for L2 cache

The memory access times are based on benchmarks, which was done with same speed memory modules [6].

The read and write multipliers are the same since Atom is also able to transfer 16 byte elements from L1/L2

cache to main memory.

The LPM lookup values also change because of the lack of L3 cache, and the di�erent L1/L2 cache latencies,

and branch misprediction costs. The estimated values can be seen in Table 7.




Table 7: LPM execution cycles and memory accesses on Intel Atom


The clock cycle estimation for the NAT function on Atom is performed in a similar way like in the Xeon

case. The main di�erences between the CPU families are the memory access times, and an overall less

performance on Atom due to architectural constraints. Reading a cache line from main memory into L1

cache costs on Atom around 64 cycles. L1 cache access on Atom is only 3 cycles, which means reading the

source IP and port takes around 6 cycles. Hashing is the same as on Xeon, the array access however is slower,

59 cycles, since the lookup arrays only �t into the main memory. Writing the new IP and port is again two

L1 accesses, and the checksum computation is 55 cycles, like on the Xeon CPU. Finally copying back from

L1 cache to main memory takes around 68 cycles. In total, the NAT uplink processing requires 268 clock

cycles, and downlink needs 258 clock cycles. As described earlier, Xeon and Atom have major architectural

di�erences, therefore performance degradation is expected on Atom compared to Sandy Bridge. We have

to take these e�ects into consideration along with the slower CPU frequency when estimating performance.

Based on the assumptions presented in Section 4.4, the performance estimation on Atom for NAT uplink is

4.47 Mpps, and for NAT downlink is 4.65 Mpps.



The expected performance of the BNG use-case can be calculated in a similar way. The I/O parts are

the same as described in the NAT performance estimation. Again, assuming concentrated �ows, the LPM

lookup in average takes 43 clock cycles (generally the lookup-table cannot �t into the L1 cache). Decreasing

the TTL and changing the MAC addresses takes around 16 cycles when the required destination MAC is

present in L1. Overall on Atom the BNG use-case should process 2.58 Mpps in the uplink, and 2.65 Mpps

in the downlink case.


In this section we map the IPSec ESP use-case, deriving the estimated performance considering separately

the costs of encryption and hashing.

AES-256 CBC performance estimation

The cost of AES-256 CBC encryption can be estimated with the following formula:


Veri�cation results can be seen on Figure 27. At 16 byte data size there is an overestimation, but on other

data sizes the error decreases, and becomes basically negligible.

Data size Estimation OpenSSL

16 B 1021.9 Mb/s 813.6 Mb/s

64 B 1085.0 Mb/s 990.9 Mb/s

128 B 1096.3 Mb/s 1059.4 Mb/s

256 B 1102.1 Mb/s 1097.2 Mb/s

512 B 1104.9 Mb/s 1117.2 Mb/s

1024 B 1106.4 Mb/s 1127.5 Mb/s

Figure 27: AES-256 performance estimation on Intel Atom

SHA-256 performance estimation

The estimated cost of SHA-256 is the following:


The measured and estimated values can be seen on Figure 28. Here the estimation works well on basically

all data sizes.


Data size Estimation OpenSSL

16 B 80.7 Mb/s 69.0 Mb/s

64 B 162.8 Mb/s 150.3 Mb/s

128 B 217.7 Mb/s 205.9 Mb/s

256 B 261.9 Mb/s 252.5 Mb/s

512 B 291.5 Mb/s 284.5 Mb/s

1024 B 308.9 Mb/s 304.2 Mb/s

Figure 28: SHA-256 performance estimation on Intel Atom

IPSec ESP

The IPSec ESP estimated performance for di�erent data sizes can be seen in Table 8.

Data size Kpps

64 104.8

128 87.1

256 65.1

512 43.3

1024 25.9

1500 18.5

Table 8: IPSec ESP estimation on Intel Atom

Mapping to Cavium ThunderX CN88XX

In this section the mapping of the model to a special ARM-based device called Cavium ThunderX is

presented [23]. First a brief introduction is given about the architecture, then the model EOs are mapped

to ThunderX, and �nally the estimated performance numbers are calculated.

Hardware architecture and speci�cations

The Cavium ThunderX CN88XX (later referred as ThunderX) is a System on a Chip solution primarily

for data-center and cloud systems. ThunderX contains 48 custom designed, ARMv8 compliant 64-bit

processors with up to 2.5 Ghz core clock frequency. There are several coprocessors in ThunderX for various

tasks: controlling network interfaces, random number generation, pattern matching, RAID unit and so on.

ThunderX supports Ethernet interfaces up to a total of 80 Gbit/s, including 40 GbE network interface cards.

The CPU cores in ThunderX comply to the ARMv8 standard. Beside standard integer, �oating point and

memory operations, the cores also support SIMD instructions as well as several cryptographic primitives.

The cores have 78 KB instruction-cache, and exclusive 32 KB data-caches, both with 128 byte cache-

line size. Each core has two ports for instruction execution. Port 0 only can perform standard integer

and load/store instructions, while Port 1 is able to execute integer, �oating point, branch, SIMD and

cryptographic instructions. For improving pipeline performance, each port also has a �parking lot� area which

is used for out-of-order execution. ThunderX is also able to merge conditional and branch instructions into


a single operation, therefore increasing the throughput of the pipeline even more. However, since certain

instructions can be only executed on one port, it is safe to say that the executed instructions/second does

not exceed 2. Thus, the whole system can execute up to 240G instructions/second, which is also con�rmed

by the datasheet.

ThunderX also provides a full-cache coherent 16 MB L2 cache, which is shared among all CPUs of ThunderX.

The line size of the L2 cache is also 128 byte. An important part of ThunderX's inner structure is the

Coherent Memory Interconnect (CMI), which provides a crossbar connection between the CPUs, memory

channels and I/O bridges, and runs on core-clock frequency. The CMI is responsible for memory coherence

support. The L2 cache initiates cache invalidations to core data caches through the CMI. The L1 caches

are write-through, however, they contain a write-bu�er to minimize write-operations on the CMI. Note that

this is a weakly-consistent memory model, but provides high-performance. The L2 cache is partitioned into

eight 2 MB data units called Tag-and-Data units (TAD). The CPU cores are also divided into eight groups.

The inner structure with the connections between the cores and TAD realized by the CMI can be seen on

Figure 29.

Figure 29: Cavium ThunderX Coherent Memory Interconnect

Unfortunately there is no information about the cache latencies. Since cache latencies are important for

mapping the model elements to the speci�c hardware, we decided to try measuring the cache access times.

For this purpose, a tool called lmbench was used [7]. The application can measure the latency and bandwidth

both of the cache and main memory. Since ThunderX runs a special version of Linux (kernel version 3.18.0-

jerin), and the tool is able to measure these values even on specialized architectures like ThunderX, it can

be used for determining memory characteristics. Unfortunately, other tools such as Intel's Memory Latency

Checker cannot be used for this purpose, since it is exclusive to the Intel architecture. Multiple tests were

run by lmbench in order to minimize the outliers in the benchmarks. According to the measurements, the

L1 cache has 1.577 ns, and L2 cache has 4.17 ns average latency. These can be converted to clock cycles

by using the 2.5 GHz core clock speed of ThunderX. Expressing in clock cycles, L1 has 4, and L2 cache has

11 clock latency. Note that these are the latencies of 16 byte transactions; the benchmarking tool measures

the latencies also on several di�erent transaction sizes. Since L2 cache is a shared resource, it is important


to know from the modeling aspect how many transactions per second can the L2 cache handle. Based on

the latency values, the one TAD of the L2 cache can serve 1/(4.17 ∗ 10−9) = 239.8 Mtps (or similarly,

when calculating with clock cycles, 2.5 Ghz/11 cycles = 227.3 Mtps). Note that the cache latency depends

heavily on cache size, other ARM CPUs with similar L1 cache size also have 4 clock L1 latency [8], thus it

is not a surprise that the measurement also resulted in this value.

Thanks to the uni�ed nature of the CMI, every TAD can be reached with the same latency from each CPU.

Therefore, the L2 cache can support up to eight parallel operations, where each TAD has the previously

calculated 239.8 Mtps. Because of that, the maximum theoretical throughput of the L2 cache is 8∗239.8 =

1918.4 Mtps. Note that this is the theoretical maximum where each TAD is utilized optimally, but in a

networking application where a large number of packets has to be processed, the cache memories are usually

under a constant stress. If every core executes applications with similar cache access-patterns, on average

the L2 cache can serve 39.96 Mtps per CPU.

Finally the main memory structure and throughput has to be examined. ThunderX has four memory

controllers, each with four memory channels, so in total there are 16 channels for all CPUs. The device

can be equipped with DDR3 (800-2133 MHz data rate) and with DDR4 (1600-2400 MHz) memory chips

as well. When using same DIMMs as in the Intel-based servers described earlier (DDR3 1333 MHz), the

maximum transaction rate of one channel is around 70 Mtps. Note that this value does not change too much

even when using higher data rate, or DDR4 DIMMs, since higher frequencies usually come with more CAS

latency. Theoretically, with evenly distributed accesses from the CPU, every controller and every channel

can be utilized, and 1120 Mtps can be reached on the whole device. If the cores execute applications

with similar memory access patterns, and reaching di�erent (physical) memory areas, in average the main

memory can support 23.3 Mtps per CPU. Note that this is the best possible case.

The memory benchmarking tool also measures main memory bandwidth and latency values. These tests

were run on a maximum of 8 GB allocated memory. The ThunderX used for the measurement had 8x8 GB

RAM, and the benchmarking tool allocated up to 8 GB arrays. The latency for one 16 B transaction was

13.28 ns, which converts into 75.301 Mtps. This is consistent with the previous calculations and matches our

expectations, since probably the arrays were allocated in the same memory module, and the array elements

were read sequentially.

Elementary Operations on ThunderX

Since the basic architecture of ThunderX is similar to a standard Intel-based server, the EOs will be much

alike to the operations presented earlier when discussing the x86 architecture. However, due to the large

number of cores, the memory throughput has to be considered, since it can be a bottleneck on a memory-

intensive application.

mem_I/O

On ThunderX, a DCA-like transfer between the I/O modules and L2 cache is available. This is very similar

to the packet transfer from NICs to L3 cache on the examined Intel Xeon architecture. However, the L2

cache is shared between 48 cores on ThunderX. On Intel CPUs the throughput of the cache was enough to

serve the required transactions using the DCA-mechanism. However, in a highly parallel environment such

as the ThunderX, memory and cache transactions can be a bottleneck in the application. Therefore, the

transaction number of L2 cache has to be determined both for the packet I/O, and for the packet processing


part. By using 16 B transactions, the transaction number is simply given by d n16e, where n is the packet

size in bytes.

Since L1 cache is the fastest, and it is not shared among the cores, it is almost always bene�cial to fetch the

packet headers into the �rst level cache. The L2 cache is a shared resource between the cores, therefore,

the overall cost of the data transfer to L1 cache depends heavily on the scenario. If there are more ALU-

instructions and L1-cache operations in the application, the L2 cache won't be a bottleneck. However, in

a simple scenario, such as port forwarding, the L2-cache might become a bottleneck based on the required

transactions from and to this memory. In each modeled scenario, the transaction counts for the L2 cache

has to be determined, and based on this information, the bottlenecks can be found in the application.

LPM lookup

Although ThunderX has several di�erent hardware accelerator engines, LPM lookup cannot be performed

with those. The considerations regarding the LPM lookup described earlier stand also on ThunderX. Note

that older ARM CPUs with short pipelines did not have branch predictors (or just very simple ones), but since

ARMv8 CPUs are fairly complex with longer pipelines, they also require this functionality. Unfortunately

no information regarding branch misprediciton was found for ThunderX, but this value should be similar to

the 17-20 cycle costs seen in Intel processors. Calculating with 20 cycles of branch misprediction cost, the

required clock cycles by the LPM lookup can be seen in Table 9.




Table 9: LPM execution cycles and memory accesses on Cavium ThunderX

Hash lookup

Since the custom ARM CPUs comply to the ARMv8 standard, they can perform CRC32 instructions, which is

also con�rmed by the hardware reference manual [23]. According to the ARMv8 instruction set architecture,

CPUs implement CRC32 with two di�erent polynomials (one of them is CRC32C), and the instructions can

be used even on 64-bit values [9]. Therefore the hash procedure should be identical to the Intel case:

calculate the CRC32 hash of a given value, and transform it into a suitable index to an array. Unfortunately

no latency information is available for CRC32 on ThunderX. By looking at Software Optimization Guide of

the ARM Cortex-A57 CPU, which realizes the same instruction set architecture, we can see that the CRC32

instructions have a latency of 3 clock cycles, which is identical to the latency on Intel CPUs [10]. It is safe

to assume that the instruction latency of CRC32 on ThunderX is also around 3 clock cycles, and overall the

hashing and index creation does not take more than 9-10 clock cycles.

Array access

Larger arrays on ThunderX typically �t into L2 cache or main memory. The access cost of these arrays

depends heavily on the actual scenario, since both memory areas are shared resources, and their maximum

transaction throughput has to be split among the cores using them. If an application is really L2-cache

heavy, it is almost always more bene�cial to use the main memory for array storage. There are four memory


Data size L2 cache DDR3 memory ALU

64 B 3.996 Mpps

23.3 Mpps 71.4 Mpps

128 B 2.220 Mpps

256 B 1.175 Mpps

512 B 0.605 Mpps

1024 B 0.307 Mpps

1500 B 0.210 Mpps

Table 11: Cavium ThunderX NAT downlink system resource bottlenecks

controllers, each of them with four channels, on ThunderX. The actual bottleneck, which in most cases will

be either the L2 cache or the main memory, depends heavily on the memory access patterns of the modeled

application.

Note that the proper allocation of memory structures is important in order to achieve the maximum per-

formance. The accesses from the CPU cores should be spread evenly among both the memory controllers

and the memory channels in order to achieve the best performance. Note that the contents of L2 cache

memory is mostly out of our hand (except using prefetch instructions), but it is somewhat less important

thanks to the crossbar structure of the CMI. With this solution, basically every CPU can reach every TAD

on the same speed, thus the transaction throughput of the L2 cache is the same in every case.

Checksum

The C code which served as a basis for the X86 Assembly instructions can also be easily compiled to ARM

architecture. For compilation the arm-linux-gnueabihf-gcc v4.8.2 standard cross-compiler was used,

with -O3 optimization and -march=armv8-a+crc architecture �ag. Using the instruction latencies provided

by the hardware reference manual of Cavium ThunderX, the required clock cycle count for these functions

can be determined [23]. By following this method, the cycle count for computing incremental checksum

with 16-bit changed value is 14 (note that ThunderX can merge the compare and branch instructions into

a single instruction). For 32-bit changed values, by following the same logic, 24 clock cycles are needed.


In this section the previously described NAT model is mapped to the ThunderX. The maximum throughput

per CPU for the NAT uplink case can be seen in Table 10, and for the downlink case in Table 11.


64 B 3.996 Mpps

23.3 Mpps 62.5 Mpps

128 B 2.220 Mpps

256 B 1.175 Mpps

512 B 0.605 Mpps

1024 B 0.307 Mpps

1500 B 0.210 Mpps

Table 10: Cavium ThunderX NAT uplink system resource bottlenecks



64 B 2.220 Mpps

23.3 Mpps 34.965 Mpps

128 B 1.536 Mpps

256 B 0.951 Mpps

512 B 0.540 Mpps

1024 B 0.289 Mpps

1500 B 0.201 Mpps

Table 12: Cavium ThunderX BNG uplink system resource bottlenecks


64 B 2.220 Mpps

23.3 Mpps 38.759 Mpps

128 B 1.536 Mpps

256 B 0.951 Mpps

512 B 0.540 Mpps

1024 B 0.289 Mpps

1500 B 0.201 Mpps

Table 13: Cavium ThunderX BNG downlink system resource bottlenecks


The BNG performance estimation is based on the previously presented model, again the resource bottlenecks

are determined and an estimation to the maximum performance is given. The maximum throughput per

CPU for each system resource for the BNG uplink case can be seen in Table 12, and for the BNG downlink

case in Table 13.


Similar to AES-NI, the ARMv8 standard contains AES-speci�c instructions for hardware acceleration [9].

Since ThunderX realizes ARMv8, these instructions can be used for high performance encryption or decryp-

tion. Moreover, the ARMv8 also has several instructions for enhancing SHA-performance and for helping

hash computation. Both SHA-1 and SHA-256 are supported. Cryptographic instructions generally use 128-

bit vector registers. This results in high performance, and it is especially e�ective for AES, since its block

size is 128 bits on every key size.

Here we only summarize the results of the calculations:

AES-256 CBC performance estimation

The cost of AES-256 CBC encryption can be estimated with the following formula:


The performance estimation values compared to the Intel Xeon results can be seen in Table 14. Note that

despite the lack of pipelining, the ThunderX with a lower clock cycle performs well compared to the Intel

Xeon on larger data sizes. The performance estimations were made with the assumption that every required

data for the computation resides in the L1 cache. Although the raw performance of ThunderX is good, 48


cores moving the data between the L1 caches and L2 cache/main memory can be a signi�cant overhead,

and has to be examined during the whole IPSec estimation.

Data size Cavium ThunderX Intel Xeon

16 B 1686.1 Mb/s 2848.3 Mb/s

64 B 2183.7 Mb/s 3038.2 Mb/s

128 B 2296.7 Mb/s 3072.3 Mb/s

256 B 2357.7 Mb/s 3089.7 Mb/s

512 B 2389.4 Mb/s 3098.4 Mb/s

1024 B 2405.6 Mb/s 3102.8 Mb/s

Table 14: Comparison of AES-256 CBC throughput on Cavium ThunderX and Intel Xeon

SHA-256 performance estimation

The cost of SHA-256 is the following:


Based on the cost function, the performance estimation of SHA-256 on Cavium ThunderX compared to Intel

Xeon on di�erent data sizes can be seen in Table 15. We can conclude that the Cavium ThunderX, despite

its lower core clock frequency and narrower pipeline, is almost twice as fast than the examined Intel Xeon

CPU. Note that this is raw computing performance, when basically the memory overheads are negligible; in

a real scenario, data has to be transferred from and to L1 cache, and on this architecture, due to the large

amount of cores, this also can be relevant.

Data size Cavium ThunderX Intel Xeon

16 B 539.2 Mb/s 288.7 Mb/s

64 B 1094.8 Mb/s 584.3 Mb/s

128 B 1467.2 Mb/s 782.1 Mb/s

256 B 1767.9 Mb/s 941.6 Mb/s

512 B 1969.7 Mb/s 1048.5 Mb/s

1024 B 2088.9 Mb/s 1111.5 Mb/s

Table 15: Comparison of SHA-256 throughput on Cavium ThunderX and Intel Xeon

IPSec ESP

By using the estimated values for AES-256 and SHA-256, an upper bound can be given to the performance

of the IPSec ESP pipeline. Table 16 shows the estimated performance both for the CPU and memory on

di�erent data sizes.


Data size AES-256 HMAC1 HMAC2 Sum CPU Kpps L2 Kpps

64 B 559 1664 1115 3338 749.0 2497.5

128 B 1063 2213 1115 4391 569.3 1248.8

256 B 2071 3311 1115 6497 384.8 624.4

512 B 4087 5507 1115 10709 233.4 312.2

1024 B 8119 9899 1115 19133 130.7 156.1

1500 B 11899 13742 1115 26756 93.4 106.3

Table 16: IPSec component clock cycles and Kpps estimation on Cavium ThunderX

It is interesting that despite the large memory overhead, even on 1500 byte packets the CPU is the bottleneck

in the computation, due to the large amount of cycles in encryption and hash computation. Note that the

Cavium performs better than the Xeon thanks to the hardware implementation of the SHA-256 function,

which means a signi�cant decrease in the required clock cycles.

Performance Evaluation

Broadband Network Gateway

This secton contains measurement results from both BNG setups. While the results for the simpli�ed setup

can be compared to the former results shown in deliverable 5.4 [34] section 4.5, the results with the pipeline

containing NAT can be compared to the modelling results.

QinQ to GRE pipeline

The measurement setup can be seen on Figure 30. Measurements were done both for bare metal case,

and 3 virtualized cases, where the interfaces towards the BNG pipeline were (1) sr-iov (PCI pass thru), (2)

vhost-user and (3) VM virtio.


Figure 30: Measurement setup for the simpli�ed BNG setup

The results can be seen at Table 17

Bare Metal SR-IOV vhost-user VM virtio

Max. pps 21.95 Mpps 18.98 Mpps 4.2 Mpps 0.67 Mpps

Loss free 65% none 100% none

Max energy 125.2W 138.9W N/A 145.6W

Table 17: results

NAT implementation and validation

In order to test the performance of a NAT function, a DPDK application was implemented. The program can

be run in standalone mode, or as a Virtual Network Function in a virtualized environment. The application

realizes a full-cone NAT, and its structure is very simple. For the uplink tra�c, after reading the IP and

L4 destination ports, DPDK's default hash is used to �nd the corresponding port. The default DPDK hash

is CRC32, which �ts our model, however, DPDK instead of creating a perfect hash implements a bucket-

hash structure. The lookup procedure in the buckets somewhat decreases the performance compared to

the model presented earlier. After getting the hash result, a simple array lookup is performed (the index

is provided by DPDK), the source IP and port is changed, and incremental checksum is calculated. The

calculation is implemented based on the corresponding RFCs, and was veri�ed using packet analyzer tools

such as Wireshark. The packet receiving and sending is completely done by DPDK, which uses Direct Cache

Access for performance reasons, as described earlier. As we can see, the implemented NAT function �ts the

previously described model well.

The Xeon NAT performance estimation is shown on Figures 31 and 32. The �gures contain the estimated

values, the real measured values on a 10 Gbps link, and the maximum theoretical throughput of the network

link. We can see that there is a large deviation on 64 byte packets between the real and estimated values.


This can be explained by the complexity of DPDK's hash algorithm. This hash isn't a perfect hash, therefore

collisions are allowed. Managing collisions is done by creating bucket lists. The bucket management and

proper array index calculating mechanisms (which is the actual output of the hash) can cause a large

overhead. When looking at the downlink case, the di�erence between the measured and estimated values is

much lower, the error is around 10%. In both cases at larger data sizes the interface limits the bandwidth,

therefore the error is naturally larger.

Figure 31: Xeon NAT UpLink estimation

Figure 32: Xeon NAT DownLink estimation

The measurements were also done on an Intel Atom server, with the same NAT application. The measured

and estimated values can be seen on Figures 33 and 34. There is quite a di�erence between the estimated and

real values in the uplink case, which is somewhat expected after seeing the results on Xeon. Unfortunately

the downlink also has a rather large error, the estimate is around 1.5 times larger than the measured value.

The reason for this could be that modeling the architectural di�erences between Atom and Sandy Bridge

can be a challenging task with many variables.


Figure 33: Atom NAT UpLink estimation

Figure 34: Atom NAT DownLink estimation

BNG validation

The BNG use-case was implemented as a Ryu-application, which can be used to program OpenFlow

switches [43, 11]. Ryu is a Python-based OpenFlow controller application which supports the protocol

up to OpenFlow 1.5. The performance tests were performed on the Ericsson Research Flow Switch (ERFS)

developed by Ericsson Hungary. The used NAT function is the previously presented NAT DPDK application

running in a virtual machine, using Inter-VM Shared Memory between the host and the guest.

The Intel Xeon measurement results and estimated values are shown on Figures 35 and 36. Again, there

is a rather large error on the estimation in the uplink case, this can be accounted to the not ideal NAT

implementation. However in the downlink case, the error between the measurements and estimation at 64

byte packet size is only 0.2%. At higher packet sizes the interface limits the performance of the software

switch, therefore the error is increasing, but usually 64-byte values are the most important when it comes

to performance.


Figure 35: Xeon BNG UpLink estimation

Figure 36: Xeon BNG DownLink estimation

Estimation and measurement values for the BNG use-case on Intel Atom can be seen on Figures 37 and 38.

Just like with Xeon, the uplink estimation is a little worse than the downlink, but this is no surprise, since

the same NAT application is used also on Atom. In the downlink case the error of the estimation is around

16% up until 512 bytes, where the 10 Gbps link cap is reached.


Figure 37: Atom BNG UpLink estimation

Figure 38: Atom BNG DownLink estimation


Conclusions

As we conclude the work on the Universal Node in UNIFY, we argue that we have clearly shown that this

is a new and e�cient solution for: Network Function Virtualization (NFV).

First, we demonstrated that UNs can be deployed as traditional compute nodes in a classic data center,

therefore from an operations personnel perspective this is similar to current IT practices. Each UN can run

several VNFs on multiple di�erent architectures at high performance. Given the ability to handle networking

much more e�ciently, the Universal Node is superior to the current generation of compute nodes used in

typical industrial NFV PoCs both in terms of functionality and performance.

Second, we have shown that the UN can take advantage of a wide range of hardware and software com-

binations, including low-cost equipment, covering the spectrum from subscriber premises to carrier-grade

data centers and the entire network location deployment options in between. In particular, the Universal

Node functionality has two distinguishing new features when compared to a classic compute blade. As we

have seen, UNs can run on multiple hardware platforms, such as ARM, PPC and x86, but there are no real

limitations that prevent the Universal Node to be run on MIPS or even on some more specialized hardware

platforms. Moreover, a UN can run functions in multiple hypervisors ranging from the really lightweight

�switch-based� or �native� execution via containers to fully-�edged virtual machines - with the full set of

virtual ports and links to �ne-tune performance if needed. Using the technologies researched and developed

in UNIFY all of these can be taken into consideration during orchestration - and the Universal Node can

�exibly (and demonstrably) provide the support needed for these decisions.

Third, on a system level and along the lines of the UNIFY vision for a uni�ed production environment, the

performance of the Universal Node is superior to that of other current NFV solutions in part because of the

possibility to seamlessly employ, for instance, a high-performance underlying software switch (such as ERFS)

as real-world deployment needs dictate, while at the same time retaining all the bene�ts of domain-oriented

orchestration. This allows us to deploy all functionality to the �best� place and chain it to the system with

the �best� available link. Within a UN domain, small, frequently-used functions can be implemented as

virtual switch actions or plugins while large, seldom used or experimental functions or third-party network

applications can be implemented as software running in virtual machines with all OS and libraries that are

necessary.

This document reported system-level performance results from real-life measurements. For example, a

Universal Node acting as Broadband Network Gateway (BNG) without a product-line quality optimization

of the implemented NAT VNF could reach the 6-10 million packets per second (Mpps) rate range on one

Intel Xeon core alone. In terms of energy e�ciency this translates to around 1 Mpps per Watt ratio. We

consider this an excellent result for a research prototype. We expect that on other platforms and with more

lightweight applications this could be even higher, but this is beyond the scope of UNIFY.

As a closing statement, we conclude that the research work on the Universal Node has successfully bridged

the gap between compute and networking, thus meeting one of the key project goals.


References

[1] http://ark.intel.com/products/64593/Intel-Xeon-Processor-E5-2630-15M-Cache-2_30-

GHz-7_20-GTs-Intel-QPI. Accessed at 2015.12.09.

[2] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-

architectures-optimization-manual.html. Accessed at 2015.12.09.

[3] http://www.7-cpu.com/cpu/SandyBridge.html. Accessed at 2015.12.09.

[4] http://valgrind.org/. Accessed at 2015.12.09.

[5] http://www.agner.org/optimize/microarchitecture.pdf. .

[6] http://www.7-cpu.com/cpu/Atom.html. Accessed at 2015.12.09.

[7] http://processors.wiki.ti.com/index.php/Lmbench. Accessed at 2015.12.09.

[8] http://www.7-cpu.com/cpu/Cortex-A15.html. Accessed at 2015.12.09.

[9] http://115.28.165.193/down/arm/arch/ARM_v8_Instruction_Set_Architecture_%

28Overview%29.pdf. Accessed at 2015.12.09.

[10] http://infocenter.arm.com/help/topic/com.arm.doc.uan0015a/cortex_a57_software_optimisation_guide_external.pdf.

Accessed at 2015.12.09.

[11] http://osrg.github.io/ryu/. Accessed at 2015.12.09.

[12] Dpdk.

[13] Google cAdvisor. https://github.com/google/cadvisorg.

[14] Open Baton. http://openbaton.github.io/.

[15] OpenMANO. https://github.com/nfvlabs/openmano.

[16] Openstack. http://www.openstack.org/.

[17] OpenStack Tacker. https://wiki.openstack.org/wiki/Tacker.

[18] TOSCA Simple Pro�le for Network Functions Virtualization (NFV) Version 1.0. http://docs.oasis-

open.org/tosca/tosca-nfv/v1.0/tosca-nfv-v1.0.html.

[19] xdpd. http://www.xdpd.org.

[20] Xen.

[21] Balázs Sonkoly et al. Deliverable 3.4: Prototype deliverable. Technical Report D3.4, UNIFY Project,

2015.

[22] M. F. Bari, S. R. Chowdhury, R. Ahmed, and R. Boutaba. nf.io: A �le system abstraction for nfv

orchestration. In Network Function Virtualization and Software De�ned Network (NFV-SDN), 2015

IEEE Conference on, pages 135�141, Nov 2015.


http://ark.intel.com/products/64593/Intel-Xeon-Processor-E5-2630-15M-Cache-2_30-GHz-7_20-GTs-Intel-QPI

http://ark.intel.com/products/64593/Intel-Xeon-Processor-E5-2630-15M-Cache-2_30-GHz-7_20-GTs-Intel-QPI

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

http://www.7-cpu.com/cpu/SandyBridge.html

http://valgrind.org/

http://www.agner.org/optimize/microarchitecture.pdf

http://www.7-cpu.com/cpu/Atom.html

http://processors.wiki.ti.com/index.php/Lmbench

http://www.7-cpu.com/cpu/Cortex-A15.html

http://115.28.165.193/down/arm/arch/ARM_v8_Instruction_Set_Architecture_%28Overview%29.pdf

http://115.28.165.193/down/arm/arch/ARM_v8_Instruction_Set_Architecture_%28Overview%29.pdf

http://infocenter.arm.com/help/topic/com.arm.doc.uan0015a/cortex_a57_software_optimisation_guide_external.pdf

http://osrg.github.io/ryu/

https://github.com/google/cadvisorg

http://openbaton.github.io/

https://github.com/nfvlabs/openmano

http://www.openstack.org/

https://wiki.openstack.org/wiki/Tacker

http://docs.oasis-open.org/tosca/tosca-nfv/v1.0/tosca-nfv-v1.0.html

http://docs.oasis-open.org/tosca/tosca-nfv/v1.0/tosca-nfv-v1.0.html

http://www.xdpd.org

[23] Cavium. Cavium ThunderX CN88XX Hardware Reference Manual.

[24] I. Cerrato, A. Palesandro, F. Risso, M. Suñé, V. Vercellone, and H. Woesner. Toward dynamic virtu-

alized network services in telecom operator networks. Computer Networks, 92:380�395, 2015.

[25] A. Csoma, B. Sonkoly, L. Csikor, F. Németh, A. Gulyás, W. Tavernier, and S. Sahhaf. Escape:

Extensible service chain prototyping environment using mininet, click, netconf and pox. demonstation.

In Proceedings of the 2014 ACM conference on SIGCOMM, pages 125�126. ACM, 2014.

[26] V. A. Cunha, I. D. Cardoso, J. P. Barraca, and R. L. Aguiar. Policy-driven vcpe through dynamic

network service function chaining. In 2016 IEEE NetSoft Conference and Workshops (NetSoft), pages

156�160. IEEE, 2016.

[27] S. Dräxler, M. Peuster, H. Karl, M. Bredel, J. Lessmann, T. Soenen, W. Tavernier, S. Mendel-Brin,

and G. Xilouris. Sonata: Service programming and orchestration for virtualized software networks.

arXiv preprint arXiv:1605.05850, 2016.

[28] European Telecommunication Standards Institute (ETSI). Network Functions Virtualisation.

[29] Felicián Németh, Wolfgang John et al. Deliverable 4.4: Public DevOpsPro code base. Technical Report

D4.4, UNIFY Project, 2016.

[30] P. FIPS. 197, Advanced Encryption Standard (AES), National Institute of Standards and

Technology, US Department of Commerce (November 2001). Link in: http://csrc. nist.

gov/publications/�ps/�ps197/�ps-197. pdf.

[31] P. FIPS. 198-1. The Keyed-Hash Message Authentication Code (HMAC)," National Institute of

Standards and Technology, 2007.

[32] S. Frankel, R. Glenn, and S. Kelly. RFC 3602: The AES-CBC Cipher Algorithm and Its Use with IPsec.

IETF, September, 2003.

[33] P. Gupta. Algorithms for routing lookups and packet classi�cation. PhD thesis, Stanford University,

2000.

[34] Hagen Woesner et al. Deliverable 5.6: Final benchmarking documentation. Technical Report D5.6,

UNIFY Project, 2016.

[35] J. Hwang, K. K. Ramakrishnan, and T. Wood. Netvm: High performance and �exible networking

using virtualization on commodity platforms. IEEE Transactions on Network and Service Management,

12(1):34�47, March 2015.

[36] S. Intel. Programming reference. Intel's software network, sofwareprojects.intel.com/avx, 2(7), 2007.

[37] Jokin Garay et al. Deliverable 3.3: Revised framework with functions and semantics. Technical Report

D3.3, UNIFY Project, 2015.

[38] S. Kelly and S. Frankel. RFC 4868 - Using HMAC-SHA-256. Technical report, HMAC-SHA-384, and

HMAC-SHA-512 with IPsec, 2007.


[39] S. Kent and K. Seo. RFC 4301: Security Architecture for the Internet Protocol, 2005.

[40] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The click modular router. ACM

Trans. Comput. Syst., 18(3):263�297, Aug. 2000.

[41] F. Lucrezia, G. Marchetto, F. Risso, and V. Vercellone. Introducing network-aware scheduling capa-

bilities in openstack. In Proceedings of the First IEEE Conference on Network Softwarization (Netsoft

2015), Apr 2015.

[42] J. Martins, M. Ahmed, C. Raiciu, V. Olteanu, M. Honda, R. Bifulco, and F. Huici. Clickos and the

art of network function virtualization. In 11th USENIX Symposium on Networked Systems Design and

Implementation (NSDI 14), pages 459�473, Seattle, WA, 2014. USENIX Association.

[43] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and

J. Turner. Open�ow: enabling innovation in campus networks. ACM SIGCOMM Computer Commu-

nication Review, 38(2):69�74, 2008.

[44] L. Molnár, G. Pongrácz, G. Enyedi, K. Zoltán, L. Csikor, F. Juhász, A. K®rösi, and G. Rétvári.

Dataplane specialization for high-performance open�ow software switching. In Proceedings of the

2016 ACM conference on SIGCOMM, August 2016.

[45] B. Pfa�, J. Pettit, T. Koponen, K. Amidon, M. Casado, and S. Shenker. Extending networking

into the virtualization layer. In Proceedings of the 8th ACM Workshop on Hot Topics in Networks

(HotNets-VIII), October 2009.

[46] N. F. Pub. 180-2. Secure Hash Standard, National Institute of Standards and Technology, US Depart-

ment of Commerce, DRAFT, 2004.

[47] L. Rizzo and G. Lettieri. Vale, a switched ethernet for virtual machines. In Proceedings of the 8th

international conference on Emerging networking experiments and technologies, CoNEXT '12, pages

61�72, New York, NY, USA, 2012. ACM.

[48] S. V. Rossem, W. Tavernier, B. Sonkoly, D. Colle, J. Czentye, M. Pickavet, and P. Demeester. Deploying

elastic routing capability in an sdn/nfv-enabled environment. In Network Function Virtualization and

Software De�ned Network (NFV-SDN), 2015 IEEE Conference on, pages 22�24, Nov 2015.

[49] F. Sánchez and D. Brazewell. Tethered linux cpe for ip service delivery. In Network Softwarization

(NetSoft), 2015 1st IEEE Conference on, pages 1�9, April 2015.

[50] A. Sapio, M. Baldi, and G. Pongracz. Cross-Platform Estimation of Network Function Performance.

In Software De�ned Networks (EWSDN), 2015 Fourth European Workshop on, pages 73�78. IEEE,

2015.

[51] W. Shen, M. Yoshida, T. Kawabata, K. Minato, and W. Imajuku. vconductor: An nfv management

solution for realizing end-to-end virtual network services. In Network Operations and Management

Symposium (APNOMS), 2014 16th Asia-Paci�c, pages 1�6, Sept 2014.


[52] J. Soares, M. Dias, J. Carapinha, B. Parreira, and S. Sargento. Cloud4nfv: A platform for virtual

network functions. In Cloud Networking (CloudNet), 2014 IEEE 3rd International Conference on,

pages 288�293, Oct 2014.

[53] J. Soares, C. Gonçalves, B. Parreira, P. Tavares, J. Carapinha, J. P. Barraca, R. L. Aguiar, and

S. Sargento. Toward a telco cloud environment for service functions. IEEE Communications Magazine,

53(2):98�106, Feb 2015.

[54] B. Sonkoly, J. Czentye, R. Szabo, D. Jocha, J. Elek, S. Sahhaf, W. Tavernier, and F. Risso. Multi-

domain service orchestration over networks and clouds: a uni�ed approach. In Proceedings of the 2015

ACM Conference on Special Interest Group on Data Communication, pages 377�378. ACM, 2015.

[55] T. Su, A. Lioy, and N. Barresi. Trusted computing technology and proposals for resolving cloud

computing security problems. In Cloud Computing Security: Foundations and Challenges, In Press.

[56] W. Zhang, G. Liu, W. Zhang, N. Shah, P. Lopreiato, G. Todeschi, K. K. Ramakrishnan, and T. Wood.

Opennetvm: Flexible, high performance nfv (demo). In 2016 IEEE NetSoft Conference and Workshops

(NetSoft), pages 359�360, June 2016.



Documents

Deliverable 5.7: Final Report on the Universal Node Architecture ...€¦ · This project is co-funded by the European Union. Document information Editor Gergely Pongrácz (Ericsson)