Real-time monitoring of microservices-based software systems · Real-time monitoring of microservices-based software systems Anno Accademico 2016/2017 relatore Ch.mo Prof. Marcello

Scuola Politecnica e delle Scienze di Base Corso di Laurea Magistrale in Ingegneria Informatica

Tesi di Laurea Magistrale in Sistemi Real Time

Real-time monitoring of microservices-based software systems Anno Accademico 2016/2017

relatore Ch.mo Prof. Marcello Cinque correlatore Ing. Raffaele Della Corte candidato Raffaele Iorio matr. M63/000492

[Dedica]

Index Index .................................................................................................................................................. IIIIntroduction .......................................................................................................................................... 5Chapter 1: The world of microservices ................................................................................................ 7

1.1 Before the microservices ............................................................................................................ 71.2 Microservices ............................................................................................................................. 9

1.2.1 Communication between microservices ........................................................................... 10Chapter 2: How to monitor microservices? ....................................................................................... 14

2.1 Real-time log analysis .............................................................................................................. 162.1.1 ELK Stack ......................................................................................................................... 16

2.2 Consideration on the state of the art ........................................................................................ 18Chapter 3: A new approach ................................................................................................................ 19

3.1 Rule-based logging .................................................................................................................. 193.2 Sniffing or proxy? .................................................................................................................... 213.3 What information do I need in the log? ................................................................................... 22

Chapter 4: MetroFunnel ..................................................................................................................... 244.1 Software Architecture .............................................................................................................. 244.2 Software workflow ................................................................................................................... 284.3 Docker version ......................................................................................................................... 31

4.3.1 Image and Dockerfile ........................................................................................................ 334.4 Elastic stack configuration ....................................................................................................... 344.5 User and operating manual ...................................................................................................... 36

Chapter 5: Comparison ...................................................................................................................... 405.1 Clearwater IMS ........................................................................................................................ 40

5.1.1 Clearwater-live-test ........................................................................................................... 445.1.2 Basic operation .................................................................................................................. 44

5.2 Testbed ..................................................................................................................................... 455.3 Performance Analysis .............................................................................................................. 47

5.3.1 Log Size ............................................................................................................................ 495.3.2 Data incoming ................................................................................................................... 515.3.3 Execution time .................................................................................................................. 545.3.4 Bandwidth ......................................................................................................................... 68

5.4 Failure Analysis ....................................................................................................................... 705.4.1 Failure 503 ........................................................................................................................ 715.4.2 Failure 181 ........................................................................................................................ 725.4.3 Failure 502 –Homestead-prov (forced kill) ...................................................................... 745.4.4 Failure 502 – Homer (forced kill) ..................................................................................... 775.4.5 Failure 504 – Overload ..................................................................................................... 785.4.6 Further considerations on failure analysis ........................................................................ 82

Conclusions ........................................................................................................................................ 84

Future developments .......................................................................................................................... 85Bibliography ...................................................................................................................................... 86

Real-time monitoring of microservices-based software systems

5

Introduction

Monitoring is one of the most widespread practices to evaluate the functioning status of a

system. However, progress in development techniques is not always followed by just as

much progress in monitoring techniques.

This is the case of microservices, and of their monitoring.

Microservices, a recent evolution of service-oriented architecture (SOA), are presented in

some studies in the period 2007/08, until their complete diffusion in 2014/15.

If a few years have passed since their affirmation, to date, there is still no specific tool, able

to monitor the microservices; all that is available are various tools or techniques that can be

adapted to microservices.

This adaptation implies that it is not always possible to obtain the right information from

monitoring; furthermore, considerable configurations are necessary before actually starting

monitoring.

The idea that has been pursued during this work is to make available to developers and the

community, a specific tool for microservices, easy to install, transparent and non-intrusive.

This was made possible, taking the best of current solutions, without bringing back all their

faults; moreover, the rule-based logging principles has been applied, which is an effective

technique for the generation of log.

Thus MetroFunnel has developed, which is a specific tool for microservices.


6

Subsequently, to validate the project, a test application was chosen, Clearwater IMS, and it

was compared with various techniques and tools available.

We first assessed the performance impact, which MetroFunnel produces compared to the

other solutions, and then the number of failures detected between the different solutions.

As it can be seen, MetroFunnel achieves excellent results from many points of view, with

minimum performance effort to pay, but which in some respects may be totally negligible.

Therefore, the proposed solution, is ready to use as a valid tool for monitoring

microservices, and at the same time, can be used as a valid starting point for even better

future developments.

The first chapter starts with a little history and why microservices are born.

Chapter 2 illustrates the techniques present today for the monitoring of microservices,

showing strengths and weaknesses, and above all, what is not currently present.

Chapter 3 presents the idea behind MetroFunnel, which is packet sniffing to generate logs,

exploiting the principles of the rule-based logging.

Chapter 4 describes the architecture of MetroFunnel, its behavior, various examples of log

and examples in function.

Then, in Chapter 5, it shows a comparison of the different solutions.


7

Chapter 1: The world of microservices

This chapter is an introduction to the world of microservices; we start with a bit of history,

showing what the architectural models of application were and their evolution up to the birth

of microservices.

Then we go into more detail, explaining the operating principles of microservices, the

various development techniques, showing their advantages and disadvantages.

1.1 Before the microservices In the beginning, there was the "monolith": applications developed and distributed as a

single entity. These monolithic applications are easy to implement because they have only

one code base, typically gathered in a single project that is distributed within a single

package. Another possible definition is: software application whose modules can not be

executed independently. This definition explains another criticality of monoliths; they are

very difficult to use in distributed systems without an appropriate framework or an ad-hoc

solution [4], [15].

This type of architecture lends itself well to small applications or in any case not subject to

changes, but it is problematic, when we find ourselves developing complex and rapidly

evolving applications.

The disadvantages of monolithic applications have quickly become clear to developers, and

therefore a new model of architecture was needed. It started with the principle of

decomposition and in particular with the logical decomposition that allowed more efficient

scalability.


8

This architecture, defined multi-tier, is generally made up of a data layer, a business logic

layer, and a presentation layer, to divide the program according to the tasks it has to manage.

Data layer takes care of the memorization, the presentation layer takes care of the interaction

with the user, the logical level contains all the processing parts, of information.

The next step was to break down applications based on business functionality (for example,

to manage an inventory, insert items, check availability, etc.) rather than a stack-level

division like a multi-tier. An application is seen as a collection of services, functionality,

and therefore the first applications based on service-oriented architecture (SOA) are

developed. The advantages of an SOA architecture are [1], [4]:

• Dynamism - New instances of the same service can be launched to split the load on

the system;

• Modularity and reuse - Complex services are composed of simpler ones. The same

services can be used by different systems;

• Distributed development – By agreeing on the interfaces of the distributed system,

distinct development teams can develop partitions of it in parallel;

• Integration of heterogeneous and legacy systems - Services merely have to

implement standard protocols to communicate.

Precisely, the last point was one of the determining factors for the birth of microservices.

As previously written, to connect different and heterogeneous systems, it was necessary to

implement ad-hoc communication mechanisms; usually, a proprietary Enterprise Service

Bus (ESB) product is used for this task. This proprietary product conceals within itself the

complexity of the coordination of the various components. Over time, changing the

configurations of the ESB becomes increasingly difficult and we tend not to use the ESB

anymore, but to make the components call each other directly, re-presenting the problems

that ESB had proposed to resolve [5].

Therefore, it was necessary to have independent, easily modifiable and replaceable services,

which communicate in a simple way.


9

1.2 Microservices The development of the microservices architecture is very linked to the affirmation of the

DevOps and Agile methodologies.

DevOps, which derives from "development" and "operations", is a software development

methodology that focuses on communication, collaboration and integration between

developers and employees of information technology.

A definition proposed is [6]:

DevOps is a set of practices intended to reduce the time between committing a change to a

system and the change being placed into normal production, while ensuring high quality.

Among the practices promoted by Agile methods there are the formation of small, cross-

functional and self-organized development teams, iterative and incremental development,

adaptive planning, and the direct and continuous involvement of the client in the

development process.

The SOAs, as they were constituted, were not sufficient, as there was a high coupling

between the various services (high coupling indicates the degree to which each component

of a software depends on the other components), thus preventing the independent

development of services and the rapid release of new software versions as required by the

two methodologies.

The services were then broken down not only on the basis of their business functionality,

but also on the principle of the Single Responsibility Principle (SRP). According to the SRP,

it is good to collect together those elements that change for the same reason, and keep the

elements that change for different reasons separate.

The microservices follow the same principle: they group operations that do the same things

and make them independent of the others. A microservice, must be able to be released in a

completely independent of others way and, above all, in a completely transparent way with

respect to its consumers.


10

The main features of a microservice are:

• Small Size

• Independency

• Flexibility

• Modularity

The size of a microservice is generally smaller than a service, but the word “micro” must

not mislead; a microservice is not only small, but must be above all simple, like UNIX

philosophy [7]:

Write programs that do one thing and do it well. Write programs to work together. Write

programs to handle text streams, because that is a universal interface.

An application must be able to keep up with the ever-changing business environment and is

able to support all modifications that is necessary for an organisation to stay competitive on

the market, therefore the microservices must be flexible and independent of each other.

Moreover, an application must be modular, composed of isolated components, where each

component contributes to the overall system behaviour.

Now it was necessary to find a simple and effective method of communication for

microservices, which overcame the limits and problems faced by SOAs.

As often happens, it is easier to use a methodology that works, rather than to develop a new

one; this explain why HTTP with REST are linked to microservice architectures.

1.2.1 Communication between microservices

There are various studies where techniques for microservices communication are discussed;

however, all agree that they must be simple, light and efficient [5][15]. For example, you

can use memory sharing (database), message exchange, or specific ad-hoc protocols. The

most important thing to focus on is maintaining the independence of each microservice over

others.

Database sharing is the easiest method for communication, it is easy and above all quick to


11

implement, but it has a huge flaw: all microservices need to know the database

implementation. Moreover, if you need to make a change to the database to implement a

new microservice, you must to be sure that this change does not affect other microservices,

making them independent on each other.

As for SOAs, implementing a protocol or an ad-hoc bus, with the increase of microservices,

leads to a considerable increase in the difficulty of implementation and maintenance of the

bus, thus preventing the development of new microservices.

The message exchange model is based on two simple messages, requests and response; a

client sends a request and waits for a response from the server.

There are two possible solutions to implement the request/response model, they are RPC

(Remote Procedure Call) and REST (REpresentational State Transfer).

In the RPC model, a call is made to a local function, which will be executed on a remote

service, of which we do not necessarily know the location.

There are several types of RPC-based technologies: some of them exhibit a separate

interface that makes the communication between clients and servers easier, even if they are

made with different technologies.

Some examples are SOAP (Simple Object Access Protocol), which is a protocol

specification for exchanging structured information, and WSDL (Web Services Description

Language), which is a formal language in XML format used for the creation of "documents"

for the description of Web Service.

Using WSDL, the public interface of a Web Service can be described, and indicates how to

interact with a specific service.

The disadvantage is the high coupling between client and server, due precisely to the WSDL

model. Each change to the server involves the new generation of the WSDL model, and

therefore a new reading of the same by the client in order to use the new implementations.

REST is an architectural style inspired by the web, which exploits what the web itself

already exposes; the main difference, compared to RPC, is the concept of resource.


12

RPC exposes services, while REST exposes “resource”. In the context of microservices, the

resource is something that a microservice knows well, such as the “Customer”, or the

“Article”. However, the application must know the format of the returned information

(representation), typically an HTML, XML or JSON (JavaScript Object Notation, is a

format suitable for the interchange of data between client-server applications) document,

but it could also be an image or any other content.

Another principle introduced by REST, useful for the development of totally decoupled

services, is HATEOAS (Hypermedia As The Engine Of Application State). This principle

states that in a REST application, the client needs to know very little about the application

in order to use it; ideally, the only thing that needs to know is the input Uniform Resource

Identifier (URI).

An URI indicates a sequence of characters that uniquely identifies a generic resource.

Examples of URIs are: a web address (URL), a document, an image, a file, a service, an e-

mail address, etc.

The server creates different representations of the resource; however, how the resource is

exposed externally, is completely separate from the way it is stored inside. The protocol that

is most used to implement REST is HTTP, but it is not necessarily the only supported one.

In any case, the methods exposed by the HTTP protocol perfectly match the REST style,

because it is possible to perform all the CRUD operations (Create, Read, Update and Delete)

on the resources by means of different type of HTTP request, as:

• GET: used to read the status of a resource

• POST: used to create a resource

• PUT: used to modify a resource

• DELETE: used to delete a resource.

Now through some examples, let's see how HTTP methods are related to resources.

Suppose we have a "test" application that manages the "users" resource, allowing consumers

to create, modify, insert and delete users, the operations on resources will be:


13

1. GET /test/users/1

2. GET /test/users

3. POST /test/users/2

4. DELETE /test/users/3

In the first example, with the expression "/test/users/1" we are going to indicate the resource,

which in this case represents the user with the identified code 1 of the “test” application.

In the second example instead, we want to have a list of all the resources contained in the

resources identified by the URI "/test/users/".

In example 3, we want to create the resource identified by the URI “/test/users/2”.

In the example 4 instead, we want to eliminate the resource identified by the URI

“/test/users/3”.

Below, a typical architecture based on microservices is shown.

Figure 1 - Microservices architecture


14

Chapter 2: How to monitor microservices?

After having seen how microservices work, we can analyse what are the techniques and

tools that can monitor the functioning of microservices. Currently it is possible to divide the

monitoring of microservices under various aspects and application domains.

For example, tools are available to monitor the development of microservices, in the

validation and design phase (example of API validation tools are Postman [22], Apiwatcher

[23], etc.); they are able to verify that the response of a microservice request is the expected

one, thus going to break down the message and its payload and compare them with the

expected ones.

Then there are all those tools that allow monitoring of computers and network resources;

among these for example very good tools are Nagios[13], Amazon Cloudwatch [20].

Otherwhise, there are tools like Hystrix [21] (by Netflix) that, by manually adding code, add

a wrapper to all calls to external systems (could be used for microservice calls), to increase

latency and failure control and stop cascading failures but, it also allows the monitoring of

resources.

Finally, there are tools for analysing logs in real time, that allow to obtain information from

the log that generates the application.

As we are interested in monitoring during the normal operation of the application, and not

in the development phase, it is not possible to use the API validation tools.

Tools like Hystrix, allow the monitoring of resources, but only by adding code in the

application; moreover, the monitoring must be seen as an additional functionality with

respect to the main purpose of using Hystrix, that is, i.e., to improve the application


15

reliability preventing that cascading failures lead to the failure of the entire application.

Let's move on to tools like Amazon Cloudwatch and Nagios.

The first is a commercial tool for collecting and monitoring parameters and log files, setting

alarms and automatically reacting to changes in AWS resources (Amazon Web Services).

It can be used to achieve system-wide visibility on resource utilization, application

performance, and operational health status. The information obtained in this way can be

used to correct the operation and keep the performance of the applications always optimal.

However, Cloudwatch can only be used for applications running on AWS and only the

commercial version exists.

Nagios is a valid suite of monitoring tools. In particular, by exploiting the URL monitoring

provided by the Nagios XI program, it is possible to have some information on the status of

the microservices (through a URL you identify also a resource, and therefore a

microservice); it is also possible to set the monitoring of an entire website, obtaining general

information on the operating status.

In the first case, Nagios requires the configuration for each URL that you want to monitor;

therefore, since the number of resources of the microservices is considerable, that this

approach can take a lot of time in the configuration phase when applied in system based on

micoservices. In addition, every modification requires a new configuration.

In the second case, it is possible to monitor an entire website, but losing the granularity of

the information at the resource level and therefore at the microservice level.

It is important to note that microservices use REST and HTTP to work, but it is not necessary

to use them through a web server, but it is also possible to use them through a different

application listening on certain TCP ports; so, Nagios may not work properly or not work

completely, in these cases.

Moreover, Nagios is a very demanding tool from the point of view of the recommended

hardware requirements; to monitor up to 250 services, is recommended, 40GB of HD and

from 1 to 4 GB of RAM [13].

Based on these considerations, let's study the main advantages and disadvantages of real-


16

time log analysis.

2.1 Real-time log analysis Currently, the best method for monitoring microservices is certainly the analysis of the log

in real time. This is possible thanks to the support of some tools that allow you to read the

log generated by the application, extrapolate the right information and show them on the

screen. To do this many applications are available, in particular we are going to consider

those distributed by Elastic [12] and in particular the ELK stack.

2.1.1 ELK Stack

The ELK stack consists mainly of 4 components, as can be seen from the following figure.

Beats is a whole family of programs useful for data collection, in particular we have Filebeat,

which is precisely a lightweight shipper, for sending the logs, to the next component of the

chain. Filebeat is useful above all in the distributed field, where there are various physical

nodes that generate log files.

Logstash is the application of the chain that takes care of receiving data, filtering them

according to a configuration file, making any changes to them, and standardizing them

according to a user-defined template.

Elasticsearch is the component that takes care of data storage; it also allows to index the

data received, for quick access and comparison between them.

Kibana is the software for the graphical interface, for data visualization; through Kibana it

Figure 2 - ELK Stack


17

is possible to create graphs according to the data received, to show only those that are

interested through a filtering, to visualize the trend over time of some parameters, etc.

To date, the one shown is one of the best tools, to perform real-time log analysis; however,

it is not always easy to implement, due in particular, to the Logstash configuration for data

filtering. Below are two excerpts of logs relating to two different applications.

1. [pid: 2177|app: 0|req: 6/8] 127.0.0.1 () {42 vars in 675 bytes} [Fri Mar 3 05:20:00

2017] GET /api/users/3/ => generated 669 bytes in 16 msecs (HTTP/1.1 200) 4

headers in 134 bytes (2 switches on core 0)

1. 10-01-2018 10:33:01.794 UTC INFO homestead.py:272: Sending HTTP PUT

request to http://homestead-prov:8889/private/6505550742%40example.com

As you can see, even if the information inside them is similar (timestamp, request URI, etc.),

their filtering is totally different; for the human eye it may seem easy, but for an automatic

filtering through an application like Logstash, it corresponds to two different configuration

files. For example, an extract of the Logstash configuration file for data filtering for the first

application is shown in the following [19]:

filter {grok { match=>{ “message”=>”\[pid: %{NUMBER:pid}\|app:

%{NUMBER:id}\|req: %{NUMBER:currentReq}/%{NUMBER:totalReq}\]

%{IP:remoteAddr} \(%{WORD:remoteUser}?\) \{%{NUMBER:CGIVar} vars in

%{NUMBER:CGISize} bytes\} %{SYSLOG5424SD:timestamp} %{WORD:method}

%{URIPATHPARAM:uri} \=\> generated %{NUMBER:resSize} bytes in

%{NUMBER:resTime} msecs \(HTTP/%{NUMBER:httpVer} %{NUMBER:status}\)

%{NUMBER:headers} headers in %{NUMBER:headersSize} bytes

%{GREEDYDATA:coreInfo}”}

Furthermore, it is important to consider that in this case, you have two logs with a well

defined data format, in all those cases, where there is not a standard in writing the log, or

the application does not generate a log file, the monitoring of microservices, through the


18

analysis of logs in real time, is very difficult if not impossible.

2.2 Consideration on the state of the art As we could understand through this chapter, the tools to carry out a monitoring of

microservices exist, however, they differ on the ease of implementation and on the

information that can be obtained. With tools like Nagios, it is simple to have information on

how a website works but it is not possible to obtain information with a granularity at the

level of the single microservice; vice versa, it is possible to have information on the single

microservice, but it requires considerable configurations.

With the analysis of the log in real time, we have much more information, but they require

many configurations strictly dependent on the application to monitor.

What is missing, is a monitoring tool for all microservices-based applications, which allows

to get amount of information and at the same time is easy to implement.

Surely among the exisiting methods and tools, the analysis of the log in real time is the one

that promises better results. However, we must find a way to provide a monitoring approach

that allows the generation of logs easy to interpret, and that is transparent to the application,

without changing its behaviour.

In order to be transparent to the application, it is not possible to think of creating frameworks

or libraries based on the instrumentation of the code, which allow adding, during the

development phase of the application, notation for the generation of logs that are easy to

interpret.

Furthermore, there is no single implementation of microservices; we have seen that the main

method is through the use of REST, but to develop REST API various frameworks can be

use, for example Spring or JAX-WS, if you want to use Java, or for example you can use

PHP code and an Apache Webserver, and so on.

However all the applications have in common the use of HTTP for the exchange of

messages; therefore we must find a method to generate logs starting from the messages

exchanged through HTTP, which allows to obtain useful information for microservices

monitoring.


19

Chapter 3: A new approach

As explained in the previous chapter, we want to provide a monitoring approach that is

completely transparent to the application and easy to use it.

To do this, we chose to generate logs starting from the messages exchanged; when an

application receives requests and sends replies, since they are dependent on the HTTP

protocol, all applications are the same and it is therefore possible to generate a generic log

for any type of application that uses microservices.

After explaining the criterion by which we can associate, the exchange of messages, with

the functioning of the microservices, it is shown how you can capture the exchanged

messages, by packet sniffing or using a proxy.

After that, will be addressed the issues concerning the information needed in a log.

3.1 Rule-based logging We have to connect the transit of the network packets, to the functioning or not of the

microservices; by capturing the exchanged messages, we have information about the HTTP

requests and responses of the applications, but these must be properly analysed in order to

understand if the monitored microservices are working properly.

The proposed approach is inspired by rule-based logging (described in the study [3]) which

allow to make logs effective to analyse software failures

This study shows the main patterns used, within the code, for the generation of error logs,

such as the construct if (condition) then log_error (), which is the most used method among

the various applications under examination.


20

These patterns, in most cases are inserted at the end of the development cycle, often without

the knowledge of the program structure. Furthermore, this means that often, the failures

occurring in the system are not reported in the logs. In fact, the study shows that "around

60% of failures caused by software faults go unreported by current logging mechanisms".

The proposed approach, is based on the rules, what to write and where to write it.

Rather than giving to the application the task of of detecting the occurrence of a failure as

well as its reporting in the log, the approach aims to verify the failures at a later time, through

the analysis of information written in the logs.

One of the rules proposed, for example, for the verification of the correct execution of a

service, is to insert at the beginning, as the first line of code, and end, as the last line of code,

the printing on the log of the beginning and end of the service. These two rules are identified

in the study as LR-1, Service STart (SST) and LR-2 Service ENd (SEN).

In doing so, if analysing the log, you notice a different number of SST and SEN, you can

identify the cases in which the function is terminated before the correct operation, whether

due to an error, unexpected exception, termination of the whole application or a timeout.

Instead, using the current logging mechanism can lead to cases where failure is not recorded

in the log, as it may terminate before the information has actually been written to the log.

As can be read at the end of the study, the rule-based approach allows to detect 94% of

failure against 34% of best practices used, but it is more difficult to detect the cause of

failure, because the information is lacking in detail, such as “file could not be opened, a

service was invoked with bad parameters, etc.”.

This approach can be useful for our purpose, but how to use it without having to modify the

application by inserting additional code?

Based on the design and behaviour of microservices, the following assumption can be made:

the arrival of an HTTP packet can be assimilated at the beginning of a service, while the

reply sent can be assimilated to the end of the service.

In this way we can adopt the same principle of functioning of the rule based logging, but

without any modifications to the application. The approach proposed in this thesis leverages


21

the previous assumption to create an ad-hoc log, containing all the information there are

needed to monitor microservices.

3.2 Sniffing or proxy? Two approaches are possible to capture packets exchanged on the network, use a proxy or

sniffing. With the proxy, we mean an application that acts as an intermediary between the

requests of the clients and the resources present on the server; in this case the message must

not be modified in any way, but only the information present must be read.

Sniffing means the practice of passive interception of data passing through the network;

neither the server nor the client are aware of the presence of a sniffer.

Both solutions are valid, but they have weaknesses. With the packet sniffing there is a risk

of packet loss (being passive, it is not always possible in time, to read all the information

present in the package or to capture all the packets that transit), thus creating false positives:

failing to intercept the answer, for the rule-based logging seen in the previous paragraph, to

a request it is not possible to associate any answer and therefore to label that request as

incorrect execution of the microservice. While, the proxy requires a minimum of

application-dependent configuration, as it needs to know which TCP ports to forward the

traffic to. However, the application-dependent configuration required by proxy is not the

only reason that led to discard that solution. An in-depth analysis highlighted other

shortcomings of the proxy, which are detailed in the following.

First, the use of a proxy, creates a single point of failure, every to make the application

robust and scalable, is in vain because everything will depend on a single node, the proxy

precisely. Second, as the load increases, the performance is strictly dependent on the proxy,

without any possibility of tuning on it or using workarounds. Moreover, even the simple

switching on and off of the monitoring becomes no longer trivial, in both cases it needs to

reconfigure the nodes appropriately.

Creating a highly reliable and scalable proxy would have been very disputable both as

working hours and as a final cost; having an excellent but expensive product involves


22

reducing the possible slice of the market that can be attacked, leading the project to failure.

With the sniffing approach of the packets, we have the probable problem of packet loss, (as

we will see later, with a minimum additional CPU effort there are no losses and therefore

no false positives), but certainly a simpler use of the monitoring tool and above all a

definitely lower cost.

At this point we are going to analyse what information we can draw from the packets that

transit on the network and which can be useful for monitoring the microservices.

3.3 What information do I need in the log? In summary, we have found a method to monitor an application based on microservices and

at the same time be transparent, without having to modify it or even know the

implementation details and its operation.

Now we need to understand what information can be useful for monitoring.

First of all, we must be able to identify the microservices, and this is possible because this

information corresponds to the fields of the HTTP request header, method and URL.

Afterwards we need to know the result of the request, and then pick up the Response Code

field in the HTTP response header.

Later we may want to discriminate between source and recipient nodes, so we need TCP

and IP headers. In particular, from the TCP header we go to pick up the Source Port and

Destination Port fields; from the IP header instead, let's take IP Source Address and IP

Destination Address.

These are the minimum information we need, for example we can also take the additional

data sent via JSON, XML and so on, for a better understanding of a possible failure, but at

this stage of development and in this version of the application are not provided.

One aspect of monitoring is also the performance evaluation of microservices, so we need

to know the execution times of the microservice. To do this we can compare the arrival time

of a packet and the instant in which the packet is ready to be sent; the difference of these

times, with a small approximation, can be assimilated to the time of execution of the


23

microservice. It should be noted that as both the request and the response are, they are

captured on the server, all delays due to transmission, any retransmission and network traffic

are already eliminated.

In order to evaluate the performance, it can also be useful, to know how many other requests,

the microservice or the application, is managing at the same time, so we have added two

additional fields within the log, which represent, respectively, the number of requests

pending at the instant of arrival of the request, and the number of pending requests at the

instant of sending the reply.

For a quick filtering of the information it is useful to have a level of alert in the log, and

therefore this will be our last field of the log.

At this point all that remains is to create an application that can track received requests and

replies sent, and that generates a log with the fields identified above.

For quick access to information, and to easily import it into third-party programs such as

Logstash, ElasticSearch and Kibana for viewing, or to import them into programs for

statistical analysis, such as JMP, we have chosen to use a data format simple and based on

the CSV format.

Since we will certainly have simultaneous connections to the same microservice, we want

to have both the request and the response in a single line, so as to have a quick feedback to

the unanswered requests, without having to search the log and understand the various

requests to which answers correspond. In this way we do not need to repeat the same

information, as IP addresses and TCP ports are simply reversed between request and

response. But this is an implementation detail that will be analysed in the next chapter.

So our log line, representing the beginning and end of a microservice will be the following:

Method, Url, IP source, TCP port source, IP Destination, TCP port destination,

Response code, Duration, Pending request at the beginning, Pending request at the end,

Info.

GET, /test/users/1, 127.0.0.1, 46594, 127.0.0.1, 8080, 200, 2.312776, 1, 0, Request –

Response


24

Chapter 4: MetroFunnel

After having seen in the previous chapter, at high level the principle of operation of this new

approach, we go down more in detail and introduce MetroFunnel. This application,

developed in Java language, allows to analyse the packets passing on the network, to filter

those with HTTP headers and to extract the information, seen in the previous chapter for the

generation of a log. Using this log, based on rule-based logging, it is possible to perform a

real-time analysis, using the ELK stack shown in chapter 2, to monitoring the microservices

and detect any failures.

Before developing the application, research was carried out to verify if there was a similar

product or that could be adapted, to extract the information seen in the previous chapter

from HTTP packets, but with a negative result, as no one allowed to carry out what we want

and the changes to be made would have been difficult implementation. For this reason, we

chose to create a project from scratch.

The architecture of MetroFunnel will now be presented, then we will proceed to describe

the functioning also through some examples; then it will be described how the version in

Docker was made, the ELK stack configuration and finally a short operating manual to

describe the various phases of operation.

4.1 Software Architecture MetroFunnel is a multithreaded Java application, which allows to analyse the traffic

exchanged on several interfaces simultaneously, a thread is instantiated for each open

interface. During start-up, it is possible to customize the capture of the packets and the


25

verification of requests without a response through a timeout, by inserting the TCP ports to

be analysed and setting the timer. It should be added that the filter on TCP ports acts in a

passive way, verifying the source and destination TCP port number of the packets; it does

not interfere in any way on the open sockets.

The project has been divided into two parts: PacketSniffer and LogManagement.

PacketSniffer is the package responsible for capturing packets, filtering, if any, on the TCP

port, and saving the fields related to requests and responses.

It consists of a single file, Sniffer.Java, and is the main core of the application; the sniffer is

set to operate in promiscuous mode, then it analyses all the packets passing on the network

and not just recipients at the open interface.

So we went in search of frameworks and libraries, that could be helpful to develop the sniffer

and we found the following libraries useful for our purpose:

• Pcap4J

• jpcap

• jNetPcap

All are java wrappers of the libpcap/Winpcap library, which provide objects with methods

and attributes for rapid project development. So the choice between these, was based on the

functionality, available versions, documentation, etc.

For this reason, in the end it was decided to use jNetPcap, of the aspect illustrated above, it

is the most complete from every point of view.

Then it uses JNI (Java Native Interface), while the other competitors use JNA (Java Native

Access), this allows jNetPcap to perform better on the others.

In addition, there is both the open source version, which will be the one we are going to use,

both a commercial version that promises better performance and technical support, which

can be used in the future as an improved and commercial version of MetroFunnel. The

1.4r1425 version for 64 bit Linux will be used, while for Java, JDK Java SE 8u151 will be

used. Because it uses the libpcap library through its wrapper, MetroFunnel needs root

privileges to work.


26

LogManagement is the package responsible for writing the log, it consists of two files:

1. EventLog.Java

2. EventLogManager.Java.

EventLog, which corresponds to the single event to be recorded on the log, includes the data

seen in chapter 3, both relating to the request and the response.

EventLogManager on the other hand takes care of physical writing on files, and also has the

methods for correspondence request-> response and verification of the timeout.

Since MetroFunnel is developed as a multithreaded application, the only method running is

the run method of the Sniffer class; the other methods present inside the class are related to

the configuration phase during startup.

In the EventLogManager class, there are methods to create an event, search for an event,

insert response, check timeout and print to file.

In the EventLog class, in addition to the constructor method, there are methods for entering

the response and for the request/response association.

The architecture is as follows:


27

Figure 3 - MetroFunnel architecture


28

4.2 Software workflow An activity diagram is now shown containing the operation of MetroFunnel.

Figure 4 - MetroFunnel Activity


29

For every package that arrives, MetroFunnel verifies that it has an HTTP header, otherwise

it is discarded. If so, it saves the information in the headers below the HTTP level; from the

IP header it saves the information regarding the source and destination IP addresses. From

the TCP header it saves the information regarding the TCP source port and TCP destination

port.

At this point check whether or not filtering is performed. If you have set up to capture

packets coming and going from all ports, go directly to the next step. Instead, if one or more

port numbers have been set, check that the source or destination TCP port has a number

from those listed; if at least one of the two numbers matches, it continues, otherwise the

package is discarded.

Then, it proceeds to analyse the packets’ HTTP headers, verifying if it is a request or a

response. In this case, for a known bug of the jNetPcap library, due to an incorrect use of

the pointer to the recomposed message, it was not possible to use the method made available,

therefore a manual check was made. To check if the message is a request or a response,

MetroFunnel controls the first 4 bytes of the header: if it starts with “HTTP” it means that

it is a response, otherwise it means that it is a request.

Based on the verification, the method picks up the necessary fields and, in the case of a

request, proceeds to create a new event object, and insert it in the list of pending events.

If the message is an answer, the method searches the event list for the first event that matches

the IP address and TCP port pairs, inverted between source and destination, calculate the

execution time and write the event on the log. In this way, the events in the log are sorted

according to the answers and not based on the arrival times of the requests.

It is possible to make this association between response and request, depending on the

functioning of the HTTP protocol. The HTTP 1.0 standard, provides that every request

corresponds to a response before proceeding with a new request, so it is not possible to have

different requests belonging to the same pair, IP and TCP, source. Starting from the HTTP

1.1 standard, it provides for the pipelining of the requests, but the answers must be in the

same order as the requests, this explains why just take the first compatible event.


30

If there is no event compatible with the response, it means that an insufficient timer has been

set and the request has already been timed out. Therefore, in the log there will be two rows

relating to the same event: one as an unanswered request, the other as a response without a

request. Analysing the log, if the pairs of events cancel each other, it means that there were

no cases of lost packets, otherwise it indicates that there was at least one packet lost.

Then, check if there are any other pending requests, which have a running time greater than

the set timeout; in this case, MetroFunnel writes every request on the log, insert as response

code “999” and deletes them from the list.

Since the requests are ordered according to the arrival time, if the first request has a time

less than the timeout, it automatically interrupts the search and waits for another packet.

Now a series of examples are shown to clarify the cases described above.

This is the basic case, a request and a correct answer.

1. GET, /test/users/1, 127.0.0.1, 46594, 127.0.0.1, 8080, 200, 60010.567, 1, 0, Request

– Response

Here an example of a false positive, we have a timeout request and a response not associated

with any request but with a positive response code.


– TIMEOUT

2. NULL, NULL, 127.0.0.1, 46594, 127.0.0.1, 8080, 200, 7564.334, 1, 0, NO

REQUEST – Response

This is a timeout request, with no response received later; it can be either a false positive if

MetroFunnel has failed to capture the answer, or a real failure if indeed the answer has never

been sent.


– TIMEOUT


31

Here instead, one request that it received the error code 502; in this case, the timeout was

high enough to insert the answer.


– Response

Finally, an example of a request expired in timeout, and then the response with error code

502 is received; in this case it is not a false positive, because the answer contains an error

code, both lines refer to the same failure.


– TIMEOUT

2. NULL, NULL, 127.0.0.1, 46594, 127.0.0.1, 8080, 502, 7564.334, 1, 0, NO

REQUEST – Response

4.3 Docker version It was decided to develop a Docker[11] version of MetroFunnel, in order to increase its

portability and facilitate its use. According to a Linux.com article[14],

Docker is a tool that can package an application and its dependencies in a virtual

container that can run on any Linux server. This helps enable flexibility and portability on

where the application can run, whether on premises, public cloud, private cloud, bare

metal, etc.

Docker provides an additional layer of abstraction and automation of operating-system-level

virtualization on Windows and Linux. Operating-system-level virtualization, also called as

containerization, it is a kernel feature of an operating system, to have multiple isolated

instances of user-space.

Docker uses the resource isolation features of the Linux kernel to allow independent

containers to run within a single Linux instance. In particular, it uses:


32

• cgroups, to provide resource limiting, including the CPU, memory, block I/O, and

network

• kernel namespaces, to isolate an application's view of the operating environment,

including process trees, network, user IDs and mounted file systems

• union-capable file system, as OverlayFS, to combine multiple directories into one

that appears to contain their combined contents

Docker can use different interfaces to access virtualization features of the Linux kernel.

It includes the libcontainer library, to directly use virtualization facilities provided by the

Linux kernel, or it can use libvirt, LXC (Linux Containers) and systemd-nspawn, to use

abstracted virtualization interfaces.

So, thanks to the features described above, a Docker container, unlike a virtual machine,

does not require an operating system. Instead, it relies on the kernel's functionality to isolate

the application's view of the operating system.

In Docker we have images and containers: A Docker image is a lightweight, stand-alone,

executable package of a piece of software that includes everything needed to run it: code,

runtime, system tools, system libraries, settings; a Docker container is the running instance

of that image.

By default, each container’s access to the host machine’s CPU cycles is unlimited. It’s

Figure 5 - Kernel functionality used by Docker


33

possible to set various constraints to limit a given container’s access to the host machine’s

CPU cycles. The scheduler used is CFS (Completely Fair Scheduler), it is a process

scheduler introduced from the 2.6.23 (October 2007) release of the Linux kernel and it is

the default scheduler. It handles CPU resource allocation for executing processes, and aims

to maximize overall CPU utilization while also maximizing interactive performance.

Through this scheduler, the CPU usage is evenly divided among all the containers that

request it and are running.

4.3.1 Image and Dockerfile

To create a Docker image, we must first create a folder with a Dockerfile and the additional

files necessary for creation. In our case are:

• MetroFunnel.jar - the Jar file to execute MetroFunnel

• libjnetpcap.so

• libjnetpcap-pcap100.so

• Dockerfile

The two files related to the libpcap library, are required from MetroFunnel that need that for

working.

The Dockerfile for generating the Docker image is the following:

Line 1 represents the basic starting image; therefore, from line 2 to line 10 are all the

operations to be performed on the base image, in particular: repository update, system

update, insertion the repository for the installation of Java, installation of Java, installation

Figure 6 - MetroFunnel Dockerfile


34

of libpcap library, cleaning. Lines 11-13 we add previously viewed files to our images.

Line 14, we set the command to execute when the container is running, then the command

to run MetroFunnel.

The command to execute the Docker container is the following:

docker run --net=host --privileged -v MetroFunnelData:/MetroFunnelData --it --rm

--name=MetroFunnel metrofunnelimage

As you can see, apart from the usual parameters, the -v parameter for the connection to the

volume has been added. This volume is necessary to allow files to survive the termination

of the container and to allow a container to read the files of the other container; in this case

to allow Filebeat to read the MetroFunnel log files. (Using the Docker version of

MetroFunnel, you need to add the -v parameter to the Filebeat execution command, to link

it to the same volume). Furthermore, the --privileged parameter has been added to allow the

container to have access to all the devices of the physical machine.

4.4 Elastic stack configuration Having a log, it is possible to configure the Elastic stack, for correct operation and to allow

real-time display of data.

The stack used is the one shown in chapter 2, when a typical solution used for monitoring

was presented. So we have Filebeat for reading the log, Logstash for parsing, Elasticsearch

for indexing and saving, finally Kibana for viewing. All related versions in Docker have

been used, creating a customized image with the configuration files, for each of them.

Unlike the example shown in chapter 2, now we do not have the problem of different logs,

our log is always the same, regardless of the application we are monitoring.

The configuration of Filebeat concerns only the path for reading the log and the IP address

where Logstash resides. With the standard version of MetroFunnel, the product log comes

to the program execution folder, the Docker version instead, writes the files to a Docker

volume, then the log path is the volume, as explained in the previous paragraph.

The configuration of Logstash instead, includes the parsing of the log, it is shown in full:


35

As you can see, even the configuration of Logstash becomes very simple; in addition to the

configuration of the input and the output, the filtering part consists only in the identification

of the fields and the conversion of the parameters as numbers.

Finally, the configurations of Elasticsearch and Kibana, also concern them only the

configuration of the IP address, so they are not shown.

At this point we show the complete architecture of MetroFunnel, highlighting the different

versions of MetroFunnel; it also shows the functionality, offered by the ELK stack to

separate the data collection node from the data visualization node.

Figure 7 – Logstash configuration file


36

4.5 User and operating manual At this point an example of MetroFunnel operation is shown; at the time of execution, it

displays all the available interfaces, and then requests the number of interfaces (which can

be physical or virtual) that you want to monitor.

Figure 8 - Complete architecture of MetroFunnel

Figure 9 - List of available interfaces


37

Next, the list of interfaces is shown again and MetroFunnel asks for the reference ID of the

interface to be monitored; you must enter the ID of the network interface on which the

microservices packets pass.

After entering the interface ID, the correct creation of the log file is shown, with its name,

and then asks if you want to enter the TCP port number to filter the data; this check is made

to each packet that transits and records the requests and the responses of the packets that

have that TCP port number, whether it is as a source or as a destination.

You can enter multiple values separated by a space or enter any to capture everything

without filtering.

Figure 10 - Insert the interface number ID

Figure 11 - Insert TCP port number

Figure 12 - Capture on any TCP port

Figure 13 - List of TCP port number


38

Finally, it asks for the max time before considering an expired timed request; this value is

unique for all requests, regardless of the Method and the reference URL.

If you chose to monitor multiple interfaces simultaneously, all the previous options are

related to the single interface.

After having inserted everything, the program starts with the monitoring of the packets that

transit on the network, showing them on the video when they receive the corresponding

response or expire by timeout.

Then you can start the Elastic stack, with the previous configuration, to see the operation of

the microservices in real time.

In the first image the Kibana interface is shown, where at the top it is possible to see the

number of events occurring over time, in the middle the rows of logs received; through the

setting of a filter, on the parameters of the log, you can view only the events we are interested

in. For example, we can set the filter on response code, a particular method or URL, etc.

In the second image you can see how easy it is to create graphs, as in this case the average

values of the execution times are shown according to the various methods.

Figure 14 - Example of capture


39

Figure 16 - Kibana

Figure 15 - Kibana histogram visualization


40

Chapter 5: Comparison

In this chapter, are highlighted the functional and performance differences between the

classic approach and MetroFunnel. To do this we chose an application, Clearwater IMS, and

in this chapter there will be an introduction to its features and functionality.

This application was chosen, as it is very widespread at the company level, known and

repeatedly used as a test application. Moreover, it is a full-bodied project, composed of

various nodes each performs their own functions, can be distributed, the application's nodes

can reside on different physical nodes, and especially for our purpose, a version developed

as microservices is available.

5.1 Clearwater IMS The project Clearwater is an open-source IMS core, developed by Met switch Networks and

released under the GNU GPLv3. IMS (the IP Multimedia Subsystem) is the standards-based

architecture that has been adopted by largest telcos as the basis of their IP-based voice, video

and messaging services, replacing legacy circuit-switched systems and previous generation

VoIP systems based on soft switching.

Clearwater provides SIP-based call control for voice and video communications and for

SIP-based messaging applications. You can use Clearwater as a standalone solution for

mass-market VoIP services, relying on its built-in set of basic calling features and

standalone subscriber database, or you can deploy Clearwater as an IMS core in conjunction

with other elements such as Telephony Application Servers and a Home Subscriber Server.

It has been chosen the Docker version of Clearwater, which is implemented as microservices


41

deployed as 11 nodes distributed in as many Docker container:

1. Etcd

2. Astaire

3. Bono

4. Cassandra

5. Chronos

6. Ellis

7. Homer

8. Homestead

9. Homestead-prov

10. Ralf

11. Sprout

Etcd is a distributed reliable key-value store for the most critical data of a distributed

system, with a focus on being:

• Simple: well-defined, user-facing API (gRPC)

• Secure: automatic TLS with optional client cert authentication

• Fast: benchmarked 10,000 writes/sec

• Reliable: properly distributed using Raft

Figure 17 - Clearwater architecture


42

Astaire pro-actively resynchronises data across a cluster of Memcached nodes, allowing for

faster scale-up/scale-down. Memcached, is a cache system, in RAM memory, to distributed

objects, to improve the speed and decrease the loading times of the pages of dynamic

database-based websites, by caching the required data and reducing the load on the database

servers. Astaire works with the Project Clearwater MemcachedStore to create a dynamically

scalable, geographically redundant, highly consistent transient data store.

Bono is Clearwater's edge proxy. It provides limited P-CSCF function and the some of

Clearwater's S-CSCF function.

P-CSCF (Proxy – Call Session Control Function) is a proxy, of the SIP protocol, and is the

first node that is crossed by an IMS terminal; it is crossed by all signalling messages, and

can check every message.

S-CSCF (Serving – Call Session Control Function) it is the main node of the signal level.

It generally acts as a stateful SIP proxy, receiving SIP messages from users, checking their

authenticity and forwarding them to other bono instances or one of the sprout instances.

Chronos is a distributed, redundant, reliable timer service. It is designed to be generic to

allow it to be used as part of any service infrastructure. It is designed to scale out horizontally

to handle large loads on the system and also supports elastic, lossless scaling up and down

of the cluster to handle extra load on the service.

Ellis contains the user database and the pool of numbers that can be allocated. It does not

contain per-line configuration - it stores all this directly in Homestead and Homer, accessing

them over their defined HTTP APIs.

Ellis is mainly written in Python. It uses Tornado for HTTP and MySQL as the underlying

database. Virtualenv is used to manage dependencies.

It provides a web GUI and underlying HTTP API for user and line creation, number

allocation, and configuration of iFCs and call services.


43

Homer is the XDMS (XML Document Management Server) component in Clearwater. It

provides storage, management and subscription to documents.

Homestead is a RESTful CRUD server built using C++ on top of Cassandra. It is designed

to be easily extensible and makes some assumptions about how you'll want to store your

data in Cassandra.

Ralf is a component of the Metaswitch Clearwater project, designed to act as the CTF

(Charging Trigger Function) for Clearwater nodes in an IMS compliant deployment. It

converts JSON bodies in HTTP requests from IMS components into Diameter Rf ACRs. It

uses memcached to store Rf session information for the duration of a session, and it uses

Chronos to send regular INTERIM ACRs to keep the session alive.

Sprout is Clearwater's SIP router. It provides most of Clearwater's S-CSCF function. It

generally acts as a stateful SIP proxy. It provides registrar function, storing registration

information in a memcached store distributed across all sprout instances. It also provides

application server function, retrieving Initial Filter Criteria documents from Homestead and

acting on them. As well as supporting external application servers, sprout has built-in

support for MMTEL services.

Cassandra is a non-relational database management system distributed with open source

license and optimized for managing large amounts of data. it takes care to preserve all the

information generated by the other nodes, in particular Ellis, Homer and Homestead.

Features:

• Decentralized: the nodes in the cluster are identical. There is no single point of

failure.

• Fault-tolerance: data is automatically replicated on multiple nodes. Replication by

different data centers is supported, and node replacement can be done without


44

downtime

• Tunable consistency: the level of consistency (both in writing and in reading) can be

modified (for example from writes never fail to block for all replicas to be readable)

• Elasticity: read or write throughput linearly with the addition of new machines

(nodes), without downtime and without interruption of any application

For any further detail and specification of Clearwater, please refer to the documentation and

official website of the Clearwater project.

5.1.1 Clearwater-live-test

A framework is made available to test the application, a suite composed of 80 tests.

Of these 80, 50 were chosen, which work by default without further configurations of

Clearwater. Tests in the framework are essentially short Ruby programs. These programs

use the Quaff library to talk over SIP to Clearwater nodes for calls, and the rest-client library

to communicate with Ellis for provisioning.

5.1.2 Basic operation

Through log analysis, package analysis using Wireshark and the logs generated by

MetroFunnel, we note the following (simplified) pattern of operation of the test suite.

It is shown to clarify the operation of Clearwater, to give an idea to the reader of what we

actually monitor, and will also be used for the analysis of the failures that will be shown in

the following paragraphs.

First of all, the client container sends a session opening request, with the POST method and

URI: “/ session” to Ellis through TCP port 80.

Subsequently we have the following operations:

1. Client request to register a telephone number to Ellis

[POST, /accounts/[email protected]/numbers/]

a. This execution involves in sequence the successive requests (8 requests)

from node Ellis to the Homer and Homestead-prov nodes


45

b. Only after receiving a reply, the client’s request is answered

2. (optional) Registration of the following telephone numbers required for the test: each

test requires a different number of telephone numbers, depending on the test being

performed, from a minimum of 1 number to a maximum of 4 telephone numbers.

Each execution of the method records only one number at a time, repeating in turn

the steps a and b of the previous point

3. Execution of the test through the SIP protocol, and microservices request, variable

in, number and requests, depending on the type of test

4. Client request for cancellation of the telephone number to Ellis

[DELETE, /accounts/[email protected]/numbers/sip%3AXXXXXX]

a. This execution involves in sequence the successive requests (6 requests)

from node Ellis to the Homer and Homestead-prov nodes

b. Only after receiving a reply, the client’s request is answered

5. (optional) Cancellation of additional telephone numbers. Each execution involves

deleting only one number at a time, always performing steps a and b of the previous

point

These steps are all performed for each repetition and for each individual test of the entire

suite.

5.2 Testbed The following Test System has been configured to compare the Classic and MetroFunnel

solutions.

Server:

• CPU: Intel I3-2100 3.1GHz

• RAM: 6 GB DDR3

• LAN: Realtek Gigabit

• OS: Ubuntu 16.04.03 LTS


46

Client side used a virtual machine hosted on an Apple physical machine.

Client:

• Host: Apple MacBook Pro (13.3 Early 2015) – 8GB RAM DDR3 – Thunderbolt

Ethernet Gigabit – OS: MacOS HighSierra 10.13.1 - VMware Fusion 8.1.0

• VM:

o CPU: 1 core dedicated

o RAM: 4GB

o OS: Ubuntu 16.04.03

For the connection, a RJ45 cross cable has been used, with a length of 1 meter and category

6 (Gigabit). The bandwidth available between the two machines was verified through the

iperf tool. 20 tests were carried out, as shown in the table:

Table 1 - iperf tests (Mbps)

test Server machine

Number of simultaneous TCP connections

Detected value on i3 machine (Mbps)

Detected value on Apple machine (Mbps)

1 i3 1 879 881 2 i3 1 926 928 3 i3 5 880 891 4 i3 5 866 867 5 i3 10 874 886 6 i3 10 820 821 7 i3 15 702 773 8 i3 15 723 802 9 i3 20 650 715

10 i3 20 705 759 11 Apple 1 939 938 12 Apple 1 930 929 13 Apple 5 932 932 14 Apple 5 940 939 15 Apple 10 938 938 16 Apple 10 935 934 17 Apple 15 942 936 18 Apple 15 939 852 19 Apple 20 915 829 20 Apple 20 942 853

min 650 715


47

In 10 tests, iperf was set as server on the machine with I3 processor, while on the other it

was set as a client with different number of simultaneous TCP connections. In the other 10,

the reverse test was performed, setting iperf as server on the Apple machine. For each test

the values measured by both machines are taken.

Of these values, the minimum value was chosen to position itself in the worst-case scenario,

thus having a bandwidth of 650Mbps available.

5.3 Performance Analysis In this phase the performances of the different solutions are compared.

Four test cases with variable load were performed:

1. Monitoring off (indicated in the table with Off)

2. Monitoring through the analysis of Clearwater internal logs (shown in the table with

Classic)

3. Monitoring through the analysis of the MetroFunnel Standard log

4. Monitoring through the analysis of the MetroFunnelDocker log

In this phase the log is not actually analysed, but the effect, on the performance of the various

solutions, is studied. In particular, we take into consideration:

• Size of the logs

• The amount of data exchanged on the network

• The execution time of the test suite

• The bandwidth used

With the Classic solution, the 10 Clearwater containers that generate logs were modified,

by instantiating on each of them Filebeat for sending the logs.

By default, each component logs to /var/log/<service>/, at log level 2 (which only includes

errors and very high level events). To see more detailed logs, you can enable debug logging.

No parameter is modified regarding the management of the logs.

With MetroFunnel solutions, we analyse how much performance degradations we have on

the server, in the two case. For the Standard version, we have a Java process for writing the


48

log and a Filebeat process for sending log. For the Docker version, we have two additional

containers (MetroFunnel to generate logs and Filebeat for sending logs) over eleven

Clearwater's containers.

To test the application, we use the live tests chosen described in the previous paragraph but

with a modify to the Rakefile and the Ruby script of start, to execute a single test several

times. (The suite has the REPEAT parameter, this allows to execute the entire suite several

times, while we wanted to repeat the same test several times before executing the next one).

This change has been made, to have, simultaneously, multiple clients that perform the same

test for a longer period of time.

With the changes made, we created the Docker image, which will be used to run the

containers for the test.

To simulate the variation of the workload, a bash ad-hoc script has been created, which takes

in 2 parameters, the number of containers and the number of repetitions of the single test.

After a couple of tests, we chose to set up repetitions, at 5 repetitions per test, while the

number of containers will be the load index of the system workload.

With a load value of 15 containers, the Ellis node crashes (as we will see later in the failure

analysis) during the execution of the test suites, failing to correctly handle all simultaneous

connections for recording and deleting telephone numbers. Therefore, we have chosen to

stay a little below the operating limit, in particular, the workload will start with 1 container

up to a maximum of 12 containers.

The purpose of the test is not to perform a Clearwater stress test, but to analyse the

performance and compare the two monitoring solutions. For this reason, these load values

have been chosen, to allow all the tests to be passed without errors, because, at each error,

the client waits a time of 30 seconds before closing the related test in timeout, falsifying the

results.

To validate the results, and at the same time, reduce the total number of tests to be performed

to have a complete coverage, 5 repeated tests were carried out on 3 different load values, in

particular 20% (2 container), 50% (6 containers) and 100% (12 containers).


49

To reduce the complexity of the Test System, only Clearwater plus any monitoring is

instantiated on the Server machine.

On the Client machine, the live-test containers are instantiated, then from the second step

on, the two containers, Logstash and ElasticSearch, are added, which are necessary in all

other scenarios.

The values of the amount of data exchanged on the network and the duration of the tests are

measured on the client machine. The data exchanged are calculated as the difference

between the total data incoming to the device at the end of the test and the total data in

incoming at the beginning of the test. These two values are taken through the use of the

Linux nload tool, executed before launching the relevant test and ended immediately after

finishing the test.

The duration of the tests, on the other hand, are calculated using the "time" command,

inserted before the launch of each test container; subsequently the values of the durations of

the whole suite and therefore of each client container are written in the table where the

average is calculated.

Instead, the values on the size of the logs are taken from the Server machine.

5.3.1 Log Size

In the Clearwater column, the sum of all logs in all containers is shown.

Clearwater is restarted for each test, so the size of the logs depends on the startup plus the

log due to the live-test.

• In Clearwater the startup (10 min) produces a log of 250-300KB

• With the MetroFunnel the startup produces a log of 100KB

• After 30 min without live-test the Clearwater log is 310KB

• After 30 min without live-test the MetroFunnel log is 250KB

This difference is due to the Heartbeat packages which are not reported in the Clearwater

logs. The measures expressed in the table are in MB.


50

Table 2 - Log size (MB)

As you can see, regardless of the load, there is a saving of almost 60% on the size of the

logs.

Clearwater logs include information on microservices, and information regarding the SIP

protocol, equal to about 15% of the total log size. This calculation can be done manually,

while the log lines concerning the SIP protocol; automatically, can not be done except by

filtering through Logstash, and then only after sending it.

Therefore, if you consider the logs for the same information, with MetroFunnel you have a

saving of almost of 50%. In the measurements concerning the comparison of incoming

data, as it should be, the logs are sent for integers, giving the burden to Logstash to filter the

information.

Below, the table containing the log size of the repeated tests respectively of the 3 selected

load values.

Table 3 - Log size test repeated (MB)

Source log Load Rep 1 Rep 2 Rep 3 Rep 4 Rep 5 Clearwater 2 15,9 15,8 15,8 15,8 15,8

6 46,9 46,9 46,9 46,9 46,9 12 93,5 93,6 93,5 93,5 93,5

MetroFunnel 2 6,5 6,6 6,5 6,5 6,5 6 19,3 19,3 19,4 19,3 19,3

12 38,7 38,7 38,8 38,7 38,7 MetroFunnelDocker 2 6,5 6,5 6,5 6,5 6,5

6 19,3 19,3 19,4 19,3 19,3 12 38,7 38,7 38,7 38,8 38,7

Load Clearwater* MetroFunnel MetroFunnelDocker Difference 1 8,1 3,3 3,3 -59,26% 2 15,9 6,5 6,5 -59,12% 3 23,6 9,7 9,7 -58,90% 4 31,4 12,9 12,9 -58,92% 5 39,1 16,1 16,1 -58,82% 6 46,9 19,3 19,3 -58,85% 7 54,7 22,5 22,5 -58,87% 8 62,4 25,8 25,8 -58,65% 9 70,2 29,0 29,0 -58,69%

10 78 32,3 32,3 -58,59% 11 85,7 35,5 35,5 -58,58% 12 93,5 38,7 38,7 -58,61%


51

As you can see, the log values are almost identical in all repetitions, except for a sporadic

100KB difference, equal to a 1.5% error in the worst case, and present only in 6 results out

of the total of 45 tests carried out.

So without further analysis we can assume the trend is confirmed.

5.3.2 Data incoming

After launching Logstash and ElasticSearch on the client machine, we compare the values

of data exchanged in incoming to the machine, between the different solutions, so how much

it affects the sending of logs from the Server machine to the Client machine.

The measures expressed in the table are in KB.

Table 4 - Data incoming (KB)

Load Off Classic MetroFunnel MetroFunnelDocker 1 4275,160 6369,280 5836,80 5856,010 2 7966,720 12134,400 10649,60 10567,680 3 11683,840 18073,600 15360,00 15349,760 4 15472,640 24074,240 20121,60 20193,280 5 19200,000 30023,680 25384,96 25210,880 6 22999,040 34211,840 29767,68 29931,520 7 26736,640 41748,480 34703,36 34846,720 8 30515,200 47656,960 39802,88 39864,320 9 34406,400 53565,440 44728,32 44789,760

10 38195,200 59494,400 49838,08 49776,640 11 41963,520 65413,120 54937,60 54917,120 12 45772,800 69314,560 59668,48 59760,640


52

As can be seen and easily imaginable, the data related to the two different versions of

MetroFunnel are overlapping completions. In the following table, the percentages of

increase of the incoming data of the different solutions are shown.

Table 5 - Ratio data incoming

Load Classic/Off MetroFunnel/Off MetroFunnelDocker/Off MetroFunnel/Classic 1 48,98% 36,53% 36,98% -8,36% 2 52,31% 33,68% 32,65% -12,24% 3 54,69% 31,46% 31,38% -15,01% 4 55,59% 30,05% 30,51% -16,42% 5 56,37% 32,21% 31,31% -15,45% 6 48,75% 29,43% 30,14% -12,99% 7 56,15% 29,80% 30,33% -16,88% 8 56,17% 30,44% 30,64% -16,48% 9 55,68% 30,00% 30,18% -16,50%

10 55,76% 30,48% 30,32% -16,23% 11 55,88% 30,92% 30,87% -16,01% 12 51,43% 30,36% 30,56% -13,92%

As for the log size, 5 repeated tests are performed on the 3 load values chosen to confirm

the trend. The measures expressed in the table are in KB.

Figure 18 - Data incoming (KB/s)


53

Table 6 - Data incoming repeated (KB)

Monitoring Load Rep 1 Rep 2 Rep 3 Rep 4 Rep 5 Average Off 2 7966,72 8007,68 8007,68 8007,68 8017,92 8001,54

6 22999,04 23152,64 23695,36 23173,12 23173,12 23238,66 12 45772,80 45998,08 46254,08 46059,52 46039,04 46024,70

Classic 2 12134,40 11970,56 12042,24 12154,88 12216,32 12103,68 6 34211,84 34017,28 33904,64 34273,28 34058,24 34093,06

12 69314,56 68751,36 68823,04 69294,08 68730,88 68982,78 MetroFunnel 2 10649,60 10465,28 10076,16 10414,08 10414,08 10403,84

6 29767,68 29173,76 29112,32 29952,00 29798,40 29560,83 12 59668,48 59648,00 59586,56 59330,56 59781,12 59602,94

MetroFunnelDocker 2 10567,68 10700,80 10741,76 10608,64 10690,56 10661,89 6 29931,52 29788,16 29358,08 29829,12 29808,64 29743,10

12 59760,64 59648,00 59688,96 59330,56 59781,12 59641,86

The difference between the various results is due to some packet retransmissions.

This difference is totally negligible on the whole quantity of data exchanged, therefore the

average of these is performed. Then the relationship between the various averages was

executed and in the following graph it is possible to see the results.

The Classic approach involves an increase in input data of about 50% compared to a 30%

increase with the MetroFunnel solution. This involves a saving of 15% exchanged data.

Figure 19 - Ratio data incoming


54

5.3.3 Execution time

At this point we take into consideration the impact on the test execution times.

The execution time of each single test suite is measured, therefore with a load equal to 2 (2

simultaneous containers), there are 2 measurements, with a load of 3 there are 3

measurements and so on.

Then the average of the different containers was carried out according to the load and

reported in the table as a single value. The values expressed in the table are in seconds. Table 7 - Execution time (s)

Load Off Classic MF MFD 1 637,709 644,754 641,278 647,060 2 664,087 673,417 660,916 673,788 3 679,662 685,640 684,978 691,898 4 697,546 713,891 721,792 734,914 5 753,103 769,174 781,832 799,084 6 852,233 864,870 885,647 913,038 7 966,282 981,108 1001,491 1034,346 8 1068,321 1084,562 1114,020 1144,939 9 1188,074 1222,121 1238,558 1275,878

10 1307,912 1345,457 1379,418 1413,562 11 1448,595 1467,049 1514,946 1560,206

Figure 20 - Execution time (s)


55

In the first 3 levels of loading, the various solutions have minimal differences in execution

times, conversely, as soon as the 3 client containers are exceeded, the differences are

accentuated.

The Classic solution has less impact on execution time, while the MetroFunnel solution

produces a greater impact.

It can also be noted that the Docker version of MetroFunnel (MetroFunnelDocker) has an

even worse effect on execution times compared to the standard dual solution.

Below is a report of the response times of the 3 monitoring solutions with respect to

execution times in the absence of monitoring.

Since there is no precise trend in the results, and even noting a certain randomness in the

results, repeated tests have been carried out in-depth statistical analysis to verify if these

results were due to measurement errors or the type of monitoring is a significant factor in

the execution times.

The table below shows the average values of the execution times measured with the different

types of monitoring and the different load values.

Figure 21 - Rate execution time


56

As anticipated, even under the same conditions (Load and Monitoring), in the different

repetitions there is a certain randomness in the results.

So we chose to perform the ANOVA test, to check if the difference between the different

monitoring methodologies was due to random and uncontrollable phenomena or to

monitoring. Therefore, it was first verified whether the monitoring factor was significant

and then the importance of this factor.

To do this analysis, we chose to divide it according to the load; It should be noted that it

was not decided to carry out a two-factor analysis, Load and Monitoring, and to see which

of the two factors is most influential, but there were 3 separate analyses of the single

Monitoring factor, based on the different load values, and therefore how important the

Monitoring factor is with a load of 20% (2 containers), 50% (6 containers) and 100% (12

containers).

Therefore, through the JMP software, the normality and homoschedasticity of the residues

were first verified, and the appropriate ANOVA test was chosen based on the results.

Table 8 – Execution time test repetead


57

If the visual test is well passed, the Saphiro-Wilk test is not exceeded, so as regards the

results of monitoring with load 2 the residues are not normal.

Figure 22 - Residual normality test with load 2


58

In this case, both the visual test and the Saphiro-Wilk test are successfully passed, so the

residues with load 6 are normal.



59

Also in this case, both the visual test and the Saphiro-Wilk test are successfully passed, so

the residues with load 12 are normal.

After analysing the normality of the residues, the homoscedasticity tests of the residues are

shown, divided according to the load.



60

As you can see the Levine test is exceeded, so the residues with load 2 are homoscedastic.

Figure 25 - Homoschedasticity test residues with load 2


61

The Levene test is not exceeded, so the residues with load 6 are not homoscedastic.



62

The Levene test is exceeded, so the residues with load 12 are homoscedastic.

In summary, we have the following characteristics of the residuals.

Table 9 - Residual test summary

Load Normality Omoscedasticity Test 2 NO YES Kruskal-Wallis 6 YES NO Test Welch’s

12 YES YES ANOVA - F test

In the table, in the test column, the ANOVA test will be reported and executed.



63

The null hypothesis is rejected; the monitoring factor is significant with a load of 2.

Figure 28 - Anova test with load 2


64




65


All three analyses are passed, so the monitoring factor is significant for all three different

load values. At this point, we went to calculate the importance of this factor in the results.

The calculated values of SST, SSA and SSE are shown in the table.



66

As can be seen from the table the Monitoring factor affects the execution time of the tests

for 62.70% with a load of 20%, has an incidence of 93.18% with a load of 6 containers and

an incidence of 86.23% with the maximum load.

Given the importance of the factors, the graph is shown on the ratios between the execution

times of the various monitoring with the execution times in the absence of monitoring of the

repeated tests, making the average of these values.

We can therefore state that the Classic monitoring has an incidence of 2% on execution

times, which does not vary excessively depending on the load.

Differently, with the MetroFunnel monitoring both in the standard version and in the Docker

version, it produces an effect on the execution times that is load-dependent, which is equal

Table 10 - Calculation of the importance of the monitoring factor

Figure 31 - Rate execution time test repetead


67

to 3% and 6% with a load of 6 while it is equal to 6% and 7.5% with a load of 12.

The performance difference obtained with the Standard the Docker version of MetroFunnel,

is due to the management of the CPU from the Linux kernel and from Docker.

Both use CFS (Completely Fair Scheduler), through the management of cgroups and CPU

Share. cgroups is a feature of the Linux kernel for allocating and limiting resources (in this

case, CPU) of process groups; therefore, every running process belongs to a cgroups that

will be scheduled through the CPU Share.

CPU Share assigns a time slot, to the various cgroups, where every task present within the

cgroups has the opportunity to perform its operations.

The standard version and the Docker version of MetroFunnel differentiate precisely for the

cgroups they belong to.

In the first case, the Standard version can use all the CPU Share assigned to that cgroups,

while in the second case, the quota is divided among the various containers, because all the

containers belong to the same cgroups.

Thus, the Docker version, uses part of CPU cycles that with the Standard version would be

used by other containers, slowing down the execution of the processes present in those

containers.

In fact, by performing a statistics of the execution times of the microservices, through the

logs produced by MetroFunnel, Standard version and Docker version, with the test at

maximum load, we see how the times of all microservices are dilated by 1-2 milliseconds,

in based on microservice.

Multiplying this delay, by the number of requests (about 250 thousand) divided by the

number of simultaneous test suites (12), a difference of about 40 seconds is obtained, which

is roughly the difference in average execution time of the two tests.

The execution times are calculated by the difference of the timestamps between arrival

request and departure response; since such timestamps are set to nanoseconds, and being

the difference of milliseconds, it is categorically excluded that it may depend only on

measurement errors.


68

5.3.4 Bandwidth

After taking into account the incoming data and the test execution time, depending on the

load, we pass to analyse the incoming band used during the execution of the tests.

The used band value is calculated using the previous results, respectively the incoming data

and the execution times as a function of the load.

The results in the table are average values expressed in KB/s. Table 11 - Bandwith incoming (KB/s)

Load Off Classic MetroFunnel MetroFunnelDocker 1 6,70 9,88 9,10 9,05 2 11,97 18,00 16,10 15,68 3 17,11 26,61 22,55 22,14 4 22,11 33,65 27,80 27,45 5 25,41 38,12 32,44 31,46 6 26,57 39,48 33,48 32,53 7 27,58 42,88 34,92 33,62 8 28,43 43,75 35,66 34,67 9 28,89 43,70 35,96 35,03

10 29,14 44,05 35,61 35,09 11 28,84 44,51 36,42 34,86 12 28,90 43,35 36,60 34,67

Figure 32 - Bandwith incoming (KB/s)


69

As can be seen from the table, the bandwidth values used are nettings below the measured

band values and shown at the beginning of the chapter. This indicates that all the results

shown have not been altered by random phenomena of network traffic. Moreover, from the

graph it is possible to notice that:

• As is easily understood, in the absence of monitoring there is less bandwidth

consumption.

• The Classic solution is the one with the highest bandwidth consumption.

• The standard version of MetroFunnel has a slightly higher bandwidth consumption,

compared to the Docker version. This is because at the same data incoming it has a

shorter execution time.

• Given the above considerations, it is possible to understand that in the ratio, data

incoming and execution time, the main factor is data incoming, because, the increase

factor in the numerator (data incoming) is greater than the increase factor in the

denominator (execution time).

Also in this case, in order to validate the results, the bandwidth values were calculated with

the results of the 5 repeated tests for the 3 load values of the incoming data and of the

execution times. The table shows the average values.

The second table shows the rate of bandwidth increase and the bandwidth rate saved

between the Classic and MetroFunnel solutions.

Table 12 - Bandwith incoming test repeted (KB/s)

Table 13 - Ratio bandwith incoming

Load Classic/Off MetroFunnel/Off MetroFunnelDocker/Off MetroFunnel/Classic 2 48,57% 30,14% 31,59% -12,41% 6 44,06% 23,13% 20,35% -14,53%

12 46,48% 22,61% 20,69% -16,29%

Load Off Classic MetroFunnel MetroFunnelDocker 2 12,03 17,87 15,65 15,83 6 27,24 39,24 33,54 32,78

12 29,49 43,20 36,16 35,60


70

As can be seen, with the MetroFunnel solution there is a 20% increase in bandwidth while

with the Classic solution there is a bandwidth increase of 45%. This means a bandwidth

saving of 16%.

5.4 Failure Analysis At this point we move on to the functional analysis between the Classic and MetroFunnel

methodology. In this phase we are going to verify that actually the information in the

MetroFunnel log is useful for monitoring purposes and that are at least comparable to the

information in the Clearwater default log.

They are shown five cases of failures. The first three are spontaneous failures of the

application, the other two are forced going to end the processes inside the containers.

For obvious reasons, the spontaneous failures that occurred during the analysis with

monitoring Off or Classic were not taken into account because it is not possible to compare

the MetroFunnel log. Instead, in the analysis phase with monitoring via MetroFunnel, both

Figure 33 - Bandwith incoming (KB/s)


71

in the Standard version and in the Docker version, it was possible to take it manually (and

therefore not via Filebeat) and save the Clearwater internal logs in a local folder on the

Server machine to perform the analysis and comparisons later.

This clarification was added in order not to induce the reader to think that the failures

occurred only with monitoring through MetroFunnel and that in some way it was one of the

triggers of the failures.

5.4.1 Failure 503

The test was in progress using the version of MetroFunnel Standard, Clearwater was

restarted and 10 minutes had been expected for startup. At this point the test for 5

simultaneous containers was launched on the client side and the shells running the test

containers show all the following error message:

RuntimeError thrown:

Account creation failed with HTTP code 503, body

{"status": 503, "message": "Service Unavailable", "reason":

"No available numbers", "detail": {}, "error": true}

The MetroFunnel log contains information such as these:

1. [POST,/accounts/[email protected]/numbers/,

192.168.1.2,46594,172.18.0.12,80,503,2.312776,1,0, Request - Response]

With the help of the test execution pattern, seen in section 5.1.2, we can see how the requests

for registration arrive at the Ellis node (172.18.0.12 - TCP 80) and it responds with code

503. In addition, the log does not contain the consequential requests to the Homer and

Homestead-prov nodes (pattern 1a e 1b). Therefore, the Ellis node does not send requests,

otherwise they would be present in the log as a normal line or as a request reported by

MetroFunnel as timeout.

Now we show, an extract from the Ellis log:


72

1. 05-01-2018 17:24:18.650 UTC ERROR homestead.py:41: Failed to ping

Homestead at http://homestead-prov:8889/ping. Have you configured your

HOMESTEAD_URL?

2. …

3. 05-01-2018 17:34:27.160 UTC WARNING numbers.py:128: No available numbers

4. 05-01-2018 17:34:27.160 UTC WARNING _base.py:138: 503 POST

/accounts/[email protected]/numbers/ (0.0.0.0): No available numbers

5. 05-01-2018 17:34:27.160 UTC ERROR web.py:1447: 503 POST

/accounts/[email protected]/numbers/ (0.0.0.0) 3.48ms

In the other Clearwater nodes, there is no information in the logs that can help in the

interpretation of the failure.

The information regarding the microservices is the same; in Clearwater there is an error

concerning the ping failed to the homestead-prov node. Probably, because this problem

occurred during the startup phase, the Ellis node is already aware that the Homestead-prov

node is not available and therefore does not forward the registration requests of the number

but automatically responds with code 503.

In this case the Clearwater log adds information, but only because the problem occurred in

the startup phase, as we will see later, in the event of problems during normal operation the

information in the MetroFunnel log is comparable to the information in the Clearwater log.

5.4.2 Failure 181

During the tests, sometimes, with different load values, a single test within the entire suite

and related to only one client container failed, reporting the following message:

Endpoint threw exception:

- Expected 100, got 181 (call ID 070443892a6498e2825f51731f5bcaff)

Reading the error message, it is easy to understand that it is not a failure at the microservice

level, as in the HTTP/1.1 standard there are no status code 181. To demonstrate this, we first


73

show two extracts from the MetroFunnel log, two extracts from the Clearwater log, then an

extract from the log that the live-test framework produces when a test fails.

Going to see the MetroFunnel log we see that all the calls of the microservices end with the

code 200/201 and there are no requests timeout and in particular there are two requests with

the same call-id ended both with the code 200:

1. [POST, /call-id/070443892a6498e2825f51731f5bcaff,

172.18.0.11,57338,172.18.0.9,10888, 200, 0.467959]

2. [POST, /call-id/070443892a6498e2825f51731f5bcaff,

172.18.0.11,41848,172.18.0.9,10888, 200,0.980225]

These two lines are extracted from the Ralf node log:

1. 30-12-2017 11:17:27.073 UTC 200 POST /call-id/

070443892a6498e2825f51731f5bcaff 0.000247 seconds

2. 30-12-2017 11:17:28.986 UTC 200 POST /call-id/

070443892a6498e2825f51731f5bcaff 0.000147 seconds

In no other Clearwater log, there is information that has the same call-id or that can be

associated with the error message.

The live-test log is instead the following:

Endpoint on 46839 received:

SIP/2.0 181 Call Is Being Forwarded

Content-Length: 0

Via: SIP/2.0/TCP

So we can say that the failure is not at the level of microservices but due to the SIP protocol

that is not object of monitoring.


74

5.4.3 Failure 502 –Homestead-prov (forced kill)

This failure is due to the forced termination of the process homestead-prov, inside the

Homestead-prov container during the execution of a test suite.

To do this it used the following command:

docker exec homestead-prov kill –9 216

The following error message is displayed on the client:

Account creation failed with HTTP code 502, body {"status": 502, "message": "Bad

Gateway", "reason": "Upstream request failed", "detail": {"Upstream error": "502",

"Upstream URL": "http://homestead-prov:8889/private/6505550742%40example.com"},

"error": true}

The MetroFunnel log is the following; as explained in chapter 4, the order of the rows in

which appear in the log is consistent with the replies received and not with the requests sent.

1. PUT,/private/6505550742%40example.com,

172.18.0.12,48260,172.18.0.7,8889,502,0.432941,2,1,Request - Response

2. GET,/private/6505550742%40example.com/associated_implicit_registration_sets,


3. PUT,/org.etsi.ngn.simservs/users/sip%3A6505550742%40example.com/simservs.x

ml, 172.18.0.12,35254,172.18.0.8,7888,200,0.258912,2,1,Request - Response

4. GET,/public/sip%3A6505550742%40example.com/associated_private_ids,


5. POST,/accounts/[email protected]/numbers/,


As you can see all the Homestead requests (lines 1,2 and 4) have as code 502. The request

to the Homer node (172.18.0.8 – TCP 7888), line 3, has code 200. Finally, there is the

answer to the client that the registration of the number was not successful with code 502.


75

Now we show, an extract of the Ellis node log and then the Homer and Homestead-prov

log. As explained in the previous chapters, by default in the log of Ellis, only the requests

sent are present, with log level INFO; the answers are not added.

Only in the event of an error, the answers are added to the log of the node that made the

request with log level WARNING or ERROR.

1. 10-01-2018 10:33:01.794 UTC INFO homestead.py:272: Sending HTTP PUT

request to http://homestead-prov:8889/private/6505550742%40example.com

2. 10-01-2018 10:33:01.795 UTC INFO homestead.py:272: Sending HTTP GET

request to http://homestead-

prov:8889/private/6505550742%40example.com/associated_implicit_registration_

sets

3. 10-01-2018 10:33:01.797 UTC INFO xdm.py:29: Sending HTTP PUT request to

http://homer:7888/org.etsi.ngn.simservs/users/sip%3A6505550742%40example.co

m/simservs.xml

4. 10-01-2018 10:33:01.798 UTC WARNING utils.py:53: Non-OK HTTP response.

HTTP 502: Bad Gateway

5. 10-01-2018 10:33:01.798 UTC WARNING numbers.py:180: Failed to update all

the backends

6. 10-01-2018 10:33:01.798 UTC INFO homestead.py:253: Sending HTTP GET

request to http://homestead-

prov:8889/public/sip%3A6505550742%40example.com/associated_private_ids


HTTP 502: Bad Gateway


HTTPResponse(code=502,request_time=0.0060460567474365234,buffer=<_io.B

ytesIO object at

0x7f4fc9c56e90>,_body=None,time_info={},request=<tornado.httpclient.HTTPR

equest object at 0x7f4fc9bcac90>,effective_url='http://homestead-


76

prov:8889/public/sip%3A6505550742%40example.com/associated_private_ids',he

aders={'Date': 'Wed, 10 Jan 2018 10:33:01 GMT', 'Content-Length': '181',

'Content-Type': 'text/html', 'Connection': 'close', 'Server': 'nginx/1.4.6

(Ubuntu)'},error=HTTPError('HTTP 502: Bad Gateway',))

9. 10-01-2018 10:33:01.807 UTC WARNING numbers.py:192: Backed out changes

after failure



Lines 1,2,3 6 are the various requests made to the Homestead and Homer nodes.

Lines 4-5 and 7-8 contain information about the failed response from the Homer node.

Finally, in line 10 is presented the response sent by Ellis to the client

In the Homestead-prov log, as was obvious, there are no rows with a timestamp following

the termination of the process.

The Homer log instead is the following:

1. 10-01-2018 10:33:01.799 UTC INFO base.py:259: Received request from

localhost - PUT

http://http_homer/org.etsi.ngn.simservs/users/sip%3A6505550742%40example.co

m/simservs.xml

2. 10-01-2018 10:33:01.800 UTC INFO xsd.py:51: Performing XSD validation

3. 10-01-2018 10:33:01.802 UTC INFO base.py:272: Sending 200 response to

localhost for PUT

http://http_homer/org.etsi.ngn.simservs/users/sip%3A6505550742%40example.co

m/simservs.xml

As you can see, the information present in the MetroFunnel log and those obtained from

Clearwater logs, are roughly the same.


77

5.4.4 Failure 502 – Homer (forced kill)

This failure is due to the forced termination of the process homer, inside the Homer

container during the execution of a test suite.

To do this it used the following command:

docker exec homer kill –9 221

The following error message is displayed on the client:

Account creation failed with HTTP code 502, body {"status": 502, "message": "Bad

Gateway", "reason": "Upstream request failed", "detail": {"Upstream error": "502",

"Upstream URL":

"http://homer:7888/org.etsi.ngn.simservs/users/sip%3A6505550622%40example.com/sim

servs.xml"}, "error": true}

The error message is the same as the previous case, except for the reference URL.

For reasons of simplicity, only the different rows of logs are reported.

MetroFunnel log:

1. PUT,/org.etsi.ngn.simservs/users/sip%3A6505550622%40example.com/simservs.x

ml, 172.18.0.12,39156,172.18.0.8,7888,502,0.368791,3,2,Request – Response

2. POST,/accounts/[email protected]/numbers/

,172.18.0.13,55378,172.18.0.12,80,502,56.252505,2,1,Request - Response

Ellis log:

1. 10-01-2018 10:40:24.378 UTC INFO xdm.py:29: Sending HTTP PUT request to

http://homer:7888/org.etsi.ngn.simservs/users/sip%3A6505550622%40example.co

m/simservs.xml


HTTPResponse(code=502,request_time=0.004508018493652344,buffer=<_io.Byt

esIO object at

0x7f23a2249a70>,_body=None,time_info={},request=<tornado.httpclient.HTTP


78

Request object at

0x7f23a2244f90>,effective_url='http://homer:7888/org.etsi.ngn.simservs/users/sip

%3A6505550622%40example.com/simservs.xml',headers={'Date': 'Wed, 10 Jan

2018 10:40:24 GMT', 'Content-Length': '181', 'Content-Type': 'text/html',

'Connection': 'close', 'Server': 'nginx/1.4.6 (Ubuntu)'},error=HTTPError('HTTP

502: Bad Gateway',))

3. ….



Also in this case, the information present in the MetroFunnel log and those obtained from

Clearwater logs, are roughly the same.

5.4.5 Failure 504 – Overload

As mentioned previously, with a load value of 15 containers, the Ellis node crashes and

therefore in the performance assessment a load value of more than 12 is not used.

In this phase, however, we take advantage of this limit, to compare how many and which

failures can be detected with the Clearwater log and with MetroFunnel log.

After a couple of tests, all the containers show error message like this:

RuntimeError thrown:

Account creation failed with HTTP code 504, body <html>

<head><title>504 Gateway Time-out</title></head>

or the following, depending on whether the failure occurred during the insertion of a

telephone number or during its cancellation:

Leaked sip:[email protected], DELETE returned 504

Failed

RestClient::GatewayTimeout thrown:

504 Gateway Timeout


79

The MetroFunnel log is as follows (only a few lines are shown):

1. DELETE,

/accounts/[email protected]/numbers/sip%3A6505550627%40example.com

,192.168.1.2,47564,172.18.0.12,80,999,9175.693857,19,23, Request – TIMEOUT

2. DELETE,

/accounts/[email protected]/numbers/sip%3A6505550664%40example.com

,192.168.1.2,50094,172.18.0.12,80,999,8971.865984,20,22, Request - TIMEOUT

3. POST,

/accounts/[email protected]/numbers/,192.168.1.2,33282,172.18.0.12,80,

999,8192.985622,22,15, Request – TIMEOUT

In particular, the MetroFunnel log detects 203 requests in timeout, of which 36 DELETE,

158 POST, 8 PUT and 1 GET.

Furthermore, analysing the source and destination IP addresses, it is possible to add that of

these 203, 193 correspond to requests made by the client node, while the other 10 correspond

to requests of the Ellis node made to the Homer and Homestead-prov nodes.

These 193, we find also in the log of MetroFunnel as response not associated with any

request; they are the responses generated by the HTTP protocol, in fact they have as

response code all 500/502/504.

1. NULL, NULL, 192.168.1.2,47564,172.18.0.12,80,500,1134430.152055,1,0,

NO REQUEST - Response

2. NULL, NULL, 192.168.1.2,50094,172.18.0.12,80,502,1134431.107602,1,0,


3. NULL, NULL, 192.168.1.2,33282,172.18.0.12,80,502,1134432.563724,2,1,


In MetroFunnel it is possible to set a timeout and through this value, MetroFunnel can

identify unanswered requests; in this case we are not faced with the case of false positives,


80

because in the answers, detected beyond the timeout, and therefore not associated with any

request, the error codes 50X are present as response codes.

Therefore, they must be considered as 2 lines related to the same event; in the first is

MetroFunnel to detect the failure, the second is the confirmation of failure.

At this point we are going to analyse more in depth the other 10 requests that MetroFunnel

detects as failures. To do this we compare the requests and responses found in the logs of

Ellis, Homer and Homestead-prov. For each request URL detected as failed, let's take a look

at how many times that URL appears in the various logs; this comparison can be made

because in the communications between Ellis and the Homer and Homestead-prov nodes in

the URL there are the telephone numbers associated with the request.

The MetroFunnel log related to the failures is as follows:

1. PUT,

/org.etsi.ngn.simservs/users/sip%3A6505550490%40example.com/simservs.xml,17

2.18.0.12,52844,172.18.0.8,7888,999,9210.065918,19,3, Request – TIMEOUT

2. DELETE, /private/6505550086%40example.com,

172.18.0.12,38582,172.18.0.7,8889,999,9209.536692,20,2, Request – TIMEOUT

3. DELETE, /private/6505550036%40example.com,

172.18.0.12,38586,172.18.0.7,8889,999,9209.063506,21,1, Request – TIMEOUT

4. PUT, /irs/315d5e00-a64d-44cc-a283-323f380006c1/service_profiles/ccca53d4-

40b5-4dcc-99d1-7b8f960665ea/filter_criteria,

172.18.0.12,38578,172.18.0.7,8889,999,9213.987397,1,8, Request – TIMEOUT

5. DELETE,

/org.etsi.ngn.simservs/users/sip%3A6505550086%40example.com/simservs.xml,

172.18.0.12,52838,172.18.0.8,7888,999,9213.405119,2,7, Request – TIMEOUT

6. PUT,


172.18.0.12,52830,172.18.0.8,7888,999,9212.817478,3,6, Request – TIMEOUT

7. PUT,


81


172.18.0.12,52834,172.18.0.8,7888,999,9211.938706,4,5, Request – TIMEOUT

8. DELETE,


172.18.0.12,52842,172.18.0.8,7888,999,8175.936295,19,1, Request – TIMEOUT

9. GET, /public/sip%3A6505550996%40example.com/associated_private_ids,

172.18.0.12,38428,172.18.0.7,8889,999,8174.563955,21,2, Request – TIMEOUT

10. PUT,


172.18.0.12,52678,172.18.0.8,7888,999,8176.211097,18,3, Request - TIMEOUT

The table shows the number of times that the same URL appears within the logs.

Table 14 - Number of row with same URL

Failure MetroFunnel Ellis Homer Homestead 1 5+1 6+1 5+5 - 2 19+1 20+1 - 19+19 3 11+1 12+1 - 11+11 4 1 1+1 - 0 5 5+1 6+1 5+5 - 6 6+1 7+1 6+6 - 7 8+1 9+1 8+8 - 8 4+1 5+1 4+4 - 9 3+1 4+1 - 4+4

10 6+1 7+1 7+7 -

In the MetroFunnel column, the number of requests with answer + the number of

unanswered requests are indicated.

In the Ellis column, as the log is organized, the number of requests sent is reported + the

number of incorrect answers (as already described in the previous paragraph, in Ellis only

the incorrect answers it receives appear).

The number of requests received + the number of responses sent are shown in the Homer

and Homestead columns.

As you can see, in all 10 cases where MetroFunnel reported a timeout request, Ellis reports

that it has received a non-HTTP response, such as can be seen in the log below:


82

20-01-2018 10:45:43.420 UTC WARNING utils.py:53: Non-OK HTTP response.

HTTPResponse(code=599,request_time=30.03533101081848,buffer=None,_body=None,t

ime_info={},request=<tornado.httpclient.HTTPRequest object at

0x7fbb69be0710>,effective_url='http://homestead-

prov:8889/public/sip%3A6505550996%40example.com/associated_private_ids',headers=

{},error=HTTPError('HTTP 599: Timeout',))

While in both the Homer logs and the Homestead-prov logs, it is possible to notice that the

number of requests received is one unit lower than the number of requests detected both

by MetroFunnel and in the Ellis log.

It is possible to notice a very particular case related to failures 9 and 10.

In both cases MetroFunnel detects a request in timeout and Ellis detects an incorrect

answer, while in the Homer and Homestead-prov logs, the number of requests received

and answers correctly (with response code 20X) corresponds to the number of requests

sent by Ellis.

Once verified that the 203 failures detected by MetroFunnel are all true, let's check how

many failures are detected within Clearwater logs; since the presence of the previous 10

failures has already been ascertained, we are looking for the remaining 193.

In the Ellis log there are 68 lines with “ERROR” level, of which 50 can be associated to

client requests, while the other 18 are errors related to detected exceptions.

Even if we consider all events as a failed client requests, the number of failures detected is

much lower than the number of failures detected by MetroFunnel (68 instead of 193).

So in this case, MetroFunnel managed to capture all the failures, while in the

Clearwater log, only 35% of the failures are present.

5.4.6 Further considerations on failure analysis

As you can see, in cases where the failure occurs in the secondary nodes and not directly

connected to the clients (Homer and Homestead-prov), both the MetroFunnel log and the

Ellis log, contain all the failures.


83

In the case in which to fail, it is the first node connected to the client, there are 65% less

failures, which is a value close to what is also found in the rule-based logging.

With the log of MetroFunnel it is possible to know the execution time of each microservice,

allowing to perform the performance monitoring; it is not possible to do this with Clearwater

logs, as this information is not present.

Furthermore, since the data is preformatted, as shown during the performance analyses, they

are easy to import into programs such as JMP for statistical analysis or via Kibana for fast

performance monitoring.

Another difference in favour of MetroFunnel is the centralized log.

In this case the Clearwater nodes and therefore the logs, are distributed, even if they all

reside on the same physical machine, while with MetroFunnel, it is possible to monitor the

traffic exchanged at the network interface level (in this case Docker Bridge - the virtual

interface of Docker that connects all containers) allowing monitor all connected nodes at

the same time and have a unique log containing both the requests and the answers, even if

grafted.

Although the information content is the same, the simplicity of understanding and the speed

with which it is possible to take the information are clearly in favour of MetroFunnel.

Moreover, through the log of MetroFunnel, being unique, it is much easier to recognize the

execution patterns, while with a distributed and fragmented log, it is much more complicated

and above all almost impossible to perform in real time.


84

Conclusions

MetroFunnel, together with the ELK stack configured as shown in previous chapters, is a

tool ready to be used to monitor microservices; it is the instrument that was missing, and

not a simple improvement to something that already existed.

The principle of operation may seem simple, but the approaches that are already used are

exploited, adding the study of rule-based logging.

MetroFunnel is effective (it can monitor performance and detect failures), transparent

(neither the application nor users should be informed that monitoring is active) and not

intrusive (the behaviour of the application must not be changed).

The results show that it is a valid tool, due to the performance impact of using MetroFunnel,

and for the detection of failures, which in the tests performed, proved to be perfect,

compared to the internal logs of the test application.


85

Future developments

MetroFunnel can be a starting point for further developments, such as improving debugging

in the event of failure; this can be done by including in the log the information exchanged

microservices via JSON or XML.

Another point that can be improved is the performance impact that it provides on the server:

if the tests are performed on a simple desktop machine, and therefore on a server that

performs better, with more resources available (CPU and RAM), it is very likely that this

impact is less; however, by adopting more efficient algorithms, or by using more performing

support libraries, this impact can be reduced.

86

Bibliography

[1] Leonard Richardson, Sam Ruby, “RESTful Web Services”, O’Reilly, 2007

[2] Bhakti Mehta, “RESTful Java Patterns and Best Practices”, Packt Publishing, 2014

[3] Marcello Cinque, Domenico Cotroneo, Antonio Pecchia, “Event Logs for the

Analysis of Software Failures: A Rule-Based Approach”, IEEE Transactions on software

engineering, vol. 39, 806-821, Jun. 2013

[4] Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara,

Fabrizio Montesi, Ruslan Mustafin, Larisa Safina, “Present and Ulterior Software

Engineering”, Springer, 195-216, 2017.

[5] Cesare Pautasso, Olaf Zimmermann, Frank Leymann, “RESTful Web Services vs.

“Big” Web Services: Making the Right Architectural Decision”, Proceedings of the 17th

international conference on World Wide Web, ACM, 805-814, 2008

[6] Bass Len,Weber Ingo, Zhu Liming, “DevOps: A Software Architect's Perspective”,

Addison-Wesley Professional, May 2015

[7] Peter Salus, “A Quarter Century of Unix”, Addison-Wesley Professional, May 1994

[8] Spring.io, https://spring.io/blog/2015/07/14/microservices-with-spring, 25/01/2018

[9] SlyTechnologies jNetPcap, http://jnetpcap.com/, 25/01/2018

[10] W3C, https://www.w3.org/Protocols/rfc2616/rfc2616.html, 25/01/2018

[11] Docker, https://www.docker.com/, 25/01/2018

[12] Elastic, https://www.elastic.co/, 25/01/2018

[13] Nagios, https://support.nagios.com/kb/article.php?id=12, 25/01/2018

[14] Linux,https://www.linux.com/news/docker-shipping-container-linux-code,25/1/2018

[15] Clearwater, http://www.projectclearwater.org/clearwater-microservices-and-docker/,

87

25/01/2018

[16] LoSviluppatore,http://losviluppatore.it/microservices-architecture-il-pattern-

architetturale-emergente-per-le-grandi-applicazioni-moderne/, 25/01/2018

[17] mokabyte, http://www.mokabyte.it/2016/12/microservizi-1/, 25/01/2018

[18] mokabyte, http://www.mokabyte.it/2017/01/microservizi-2/, 25/01/2018

[19] Medium,https://medium.com/aubergine-solutions/real-time-api-performance-

monitoring-with-es-beat-logstash-and-grafana-21f67655f41e, 25/01/2018

[20] Amazon, https://aws.amazon.com/it/cloudwatch/, 25/01/2018

[21] Netflix, https://github.com/Netflix/Hystrix/wiki, 25/01/2018

[22] Postman, https://www.getpostman.com/, 25/01/2018

[23] Apiwatcher, https://www.apiwatcher.com/, 25/01/2018

Documents

Real-time monitoring of microservices-based software systems · Real-time monitoring of microservices-based software systems Anno Accademico 2016/2017 relatore Ch.mo Prof. Marcello