Exploiting Hardware-Heterogeneity-Driven Actor Migration ... · hosted in data centers, which use a large amount of energy and thus causing a big strain on the environment and power

Lehrstuhl für Informatik 4 · Verteilte Systeme und Betriebssysteme

Exploiting Hardware-Heterogeneity-Driven ActorMigration in an Energy-Aware Middleware Platform

Thao-Nguyen Do

Masterarbeit im Fach Informatik

30. März 2017

Please cite as:Thao-Nguyen Do, “Exploiting Hardware-Heterogeneity-Driven Actor Migrationin an Energy-Aware Middleware Platform” Master’s Thesis, University ofErlangen, Dept. of Computer Science, March 2017.

www4.cs.fau.de

Friedrich-Alexander-Universität Erlangen-NürnbergDepartmentInformatik

Verteilte Systeme und Betriebssysteme

Martensstr. 1 · 91058 Erlangen · Germany

http://www4.cs.fau.de

Exploiting Hardware-Heterogeneity-Driven ActorMigration in an Energy-Aware Middleware Platform

Masterarbeit im Fach Informatik

vorgelegt von

Thao-Nguyen Do

geb. am 31. Dezember 1991in Nürnberg

angefertigt am

Lehrstuhl für Informatik 4Verteilte Systeme und Betriebssysteme

Department InformatikFriedrich-Alexander-Universität Erlangen-Nürnberg

Betreuer: Dipl.-Inf. Christopher EibelDr.-Ing. Tobias Distler

Betreuender Hochschullehrer: Prof. Dr.-Ing. habil. Wolfgang Schröder-Preikschat

Beginn der Arbeit: 30. September 2016Abgabe der Arbeit: 30. März 2017

Erklärung

Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ohne Benutzung anderer als der angege-benen Quellen angefertigt habe und dass die Arbeit in gleicher oder ähnlicher Form noch keineranderen Prüfungsbehörde vorgelegen hat und von dieser als Teil einer Prüfungsleistung angenom-men wurde. Alle Ausführungen, die wörtlich oder sinngemäß übernommen wurden, sind als solchegekennzeichnet.

Declaration

I declare that the work is entirely my own and was produced with no assistance from third parties.I certify that the work has not been submitted in the same or any similar form for assessment to anyother examining body and all references, direct and indirect, are indicated as such and have beencited accordingly.

(Thao-Nguyen Do)Erlangen, 30. März 2017

A B S T R A C T

Energy is a precious resource, as its usage is connected to economical and also environmental costs.Thus, decreasing energy consumption is a major interest to reduce these costs. However, software isoften designed without awareness for the application’s energy consumption. This results in softwarewhich uses more resources than it requires to perform its tasks making it energy-unaware.

The ReApper platform introduced by this work allows the addition of energy-saving methods toapplications by utilizing configuration and information monitoring at runtime in order to adjust theenergy consumption to the application’s workload. This is done without forcing application develop-ers to concern themselves with the underlying mechanisms for reducing the energy consumptionof their application. The platform’s configuration methods are extended with migration, which isused to exploit the properties of heterogeneous hardware components in order to further lower anapplication’s energy consumption.

The ReApper platform is able to achieve considerable energy savings with its methods compared tothe execution on a platform without energy awareness. Energy awareness is added to applications bychanging the executing underlying platform with an energy-aware platform. This relieves applicationdevelopers from having to manually optimize their applications for lower energy consumption.

v

KU R Z FA S S U N G

Energie ist eine wertvolle Ressource, weil ihre Nutzung mit wirtschaftlichen und ökologischenKosten verbunden ist. Aus diesem Grund ist die Reduzierung des Energieverbauchs von zentralemInteresse, da die damit verbunden Kosten gesenkt werden. Allerdings wird der Energieverbrauchbeim Softwareentwicklungsprozess oftmals nicht beachtet. Die daraus resultierende Software benutztdeswegen häufig mehr Ressourcen und Energie als notwendig wären um ihre Aufgaben zu erledigen.Die Software ist deshalb nicht energiegewahr.

In dieser Arbeit wird die ReApper-Plattform vorgestellt, die Anwendungen energieeinsparendeMaßnahmen zur Verfügung stellt. Die ReApper-Plattform nutzt dabei Konfigurationsmethoden unddie Überwachung von Information, um den Energieverbrauch an die tatsächliche Auslastung derAnwendung anzupassen. Dieser Vorgang findet statt, ohne dass sich Anwendungsentwickler mitden darunterliegenden energieeinsparenden Methoden befassen müssen. Darüber hinaus werdendie Konfigurationsmethoden der Plattform durch einen Migrationsmechanismus erweitert, der esermöglicht, die Eigenschaften von heterogener Hardware auszunutzen, um den Energiebedarf nochweiter zu senken.

Die ReApper-Plattform erreicht mit ihren Methoden beträchtliche Einsparungen in Bezug auf denEnergieverbrauch im Vergleich zur Ausführung der Anwendung auf einer nicht energiegewahrenPlattform. Energiegewahrsamkeit wird somit durch den Austausch der anwendungsausführendenPlatform erreicht. Folglich müssen Anwendungsentwickler den Energieverbrauch ihrer Anwendungennicht mehr von Hand optimieren und können diesen Aspekt durch die Wahl ihrer Ausführungsplatt-form berücksichtigen.

vii

C O N T E N T S

Abstract v

Kurzfassung vii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Fundamentals 32.1 Actor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Origins of the Actor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Actors and their Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Actor Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 Actor System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.5 Distributing Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.6 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.7 Benefits of the Actor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Akka Toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 History and Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Akka Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 Actor Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.4 Actor Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.5 Benefits of Akka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Energy-aware Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Energy Measuring Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Reconfiguration for Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Distribution and Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.1 Migration Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.1.1 Virtual-machine Migration . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.1.2 Actor Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.2 Heterogeneous Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Design and Architecture 253.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Hardware Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

ix

Contents

3.3 Operating-system Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Platform Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5 Application Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Implementation 374.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Actor Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1.2 Actor Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1.3 Actor Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1.4 Actor Lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.1.5 Actor Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.1.6 Remoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 ReApper Middleware Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 ReApper Actor System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 Configurator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4.1 Thread Configurator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4.2 Thread Pinner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4.3 Core Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4.4 Power Capper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5 Gatherer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.1 Performance Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.5.2 Energy and Power Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.5.3 System and Process Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.5.4 Application-level Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.6 Distributed Actor Platform on Heterogeneous Hardware . . . . . . . . . . . . . . . . . . 624.6.1 Actor Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.6.2 Exploiting Heterogeneous Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Analysis 715.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.1 Environment and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.1.2 Key–value Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.1.3 Data-stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.1 Energy Modeling and Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.3.2 Migration and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 Conclusion 89

Lists 91List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

x

1I N T R O D U C T I O N

Many of today’s services like social media and electronic communication services are a part of dailylife and form a vital piece of the digital infrastructure for many people. Most of those services arehosted in data centers, which use a large amount of energy and thus causing a big strain on theenvironment and power infrastructure. The energy consumption of these data centers has grown to alarge amount, as described in [1]. The total energy consumption of data centers in the United Stateshas reached amounts of more than 60 billion kW h per year. The energy used for these data centersis not only generated from environmental-friendly energy sources, but also from energy sourceswhich emit large amounts of carbon dioxide. Aside from energy, water is also used for the operationof data centers, as it is used for cooling purposes. This puts additional strains on the environment.Apart from the environmental effects, there is also a substantial economical interest to keep theenergy consumption as low as possible. For instance, an estimated 23 percent of the monthly costsof a data center are caused by power costs [2]. This is why, a lot of effort is made to lower theenergy consumption of data centers [3] and even metrics were devised to measure the performanceof green data centers [4]. This is on the one hand achieved by employing more energy-efficienthardware and using more efficient cooling techniques, but on the other hand software can also bemodified to address this issue. Energy awareness is usually not taken into consideration when newsoftware is developed. However, adding this aspect enables software to save energy, which can beparticularly effective for dynamic applications with varying utilization.

1.1 Motivation

There are approaches to consolidate the server load on as few machines as possible, as this allowsthose machines to run more efficiently, while unused machines can be powered off or suspended.This improves the energy efficiency, but does not address another fundamental problem, whichis static allocation of resources. Resources for applications are often statically allocated duringthe startup process or the initialization of an application. Consequently, applications occupy moreresources than they might currently need for their execution. This wastes energy, as resources aresupplied with power even though they are not required or used. However, allocating not enoughresources has the opposite effect, which restricts the application’s performance unnecessarily. Thus,static allocation is often disadvantageous for the changing resource need of dynamic applications.

This work proposes to use dynamic configuration of a machine’s resources to adjust the applica-tion’s resource usage to the actual workload experienced by the application. This is achieved bydirectly restricting system resources when the application has low levels of utilization and making

1

1.1 Motivation

them available again when the workload increases. Furthermore, the responsibility of configuring amachine’s resources is taken from the application developer, as this process is handled by this work’splatform, which provides an execution platform for applications. In addition to the configuration,information provided by the machine and the application is constantly monitored in order to recog-nize changes in the application’s demand for resources and trigger reconfiguration of the machine ifrequired. Such a reconfiguration of a machine affects a multitude of resources, for example, CPUpower limits, CPU core count, and thread count. Apart from the configuration of a single machine,this approach is also applied to multiple machines in order to provide scaling capabilities to theapplication. The configuration options are extended by a migration mechanism which allows theapplication or parts of it to be moved to other machines. The migration mechanism in combinationwith heterogeneous hardware enables the execution of the application to a more energy-efficientmachine at the expense of the maximum achievable performance. This type of hardware is usedfor workload levels which can be handled by more energy-efficient hardware without taking hits tothe experienced service quality of the application. For workload levels which cannot be handledby that kind of hardware anymore, the application is moved to a more powerful but less efficientmachine. This exploits the different properties such as energy efficiency of heterogeneous hardwarecomponents in order to lower the overall energy consumption of the application’s execution. Theresult is the ReApper1 platform, which allows applications to be executed in an energy-aware mannerwhile also providing the capability to scale the application.

1.2 Outline

This thesis starts with Chapter 2, which introduces fundamental technologies and concepts used inReApper such as the actor model, the framework which serves as ReApper’s basis, and backgroundson usage of energy and migration in this work. Then, the design and architecture of ReApperis presented in Chapter 3. After that, the implementation of ReApper is described in Chapter 4.Chapter 5 is an analysis of the ReApper platform, which includes the evaluation, a brief discussion ofsome of ReApper’s aspects, and related work. The thesis is concluded in Chapter 6, which summarizesthis work.

1Short form for runtime for energy-aware–actor applications with migration-enabled reconfiguration, pronounced reaper

2

2F U N DA M E N TA L S

In order to implement an energy-aware middleware platform, a basic runtime environment isneeded. In this work’s scenario, two aspects are of special interest, which are distribution anddynamic resource allocation. The actor model, introduced in the first section, serves as an excellentfoundation for this work’s platform, as it provides dynamic allocation of resources and an abstractionlayer for separating computation from resources. This is achieved by packaging computation inunits called actors, which can be moved across threads, processors, and machines. These actors arealso decoupled from underlying threads, as resource allocation is taken care of by the actor runtimeenvironment. Resources can be assigned and removed dynamically and are provided to the actorson demand.

Afterwards, the Akka toolkit is presented. It contains tools and a runtime environment forconcurrent, distributed, and message-driven applications on the Java Virtual Machine (JVM). TheAkka actor module implements the actor model which is in turn used as this work’s base runtime.Support for distributing actors is added as another module of the Akka toolkit called Akka Remote,which is used to distribute actors and enables migration of actors. The Akka toolkit was chosen,as it is the replacement for Scala’s Actor API [5] and has a large user base thus including a widepotential audience for this work.

Following that, the background for the migration used in this work is displayed. Two types ofhardware are of interest for this work’s migration. On one side, there is the hardware used forexecuting code, that is, processing machines. They can be divided into different classes which havedifferent foci with substantially varying properties regarding computing power and energy consump-tion. In general, smaller devices use less energy while larger devices provide more processing power.

Then, there are devices and methods used to gather data about energy consumption. As thereare different types of information of interest, these devices also differ considerably. Measurementswith different purposes are needed to capture both a system’s entire energy consumption as wellas energy consumption of certain highly energy-demanding components. Devices for measuring awhole system’s energy consumption are employed in conjunction with methods which are able tomeasure the central processing unit (CPU) and memory by themselves.

After that, different levels of migration are described. These levels differ substantially regardingcosts and granularity. Ultimately, application-level migration was chosen and the reasons for thatare explained in the last section.

3

2 Fundamentals

2.1 Actor Model

In traditional programming, concurrency is often difficult and tedious to implement [6, 7]. Developershave to manually request resources like threads or processes, distribute computation steps andresponsibilities, and coordinate access to shared data structures. Neglecting any of those concernscould result in excess of resources, below expected performance, or possibly software failure.Providing more resources than necessary increases energy consumption while performance does notincrease. At the same time, low performance could be the effect of insufficient or incorrectly assignedresources. Consequently, a lot of development time has to be spent on addressing those issues.Further problems are caused by static resource allocation and close interweaving of functionalityand threads. The actor model alleviates some of those matters by employing a concurrency modelwhich avoids manual synchronization and separates functionality from underlying threads. Thisfrees developers from concerning themselves with concurrency coordination and allows dynamicassignment and removal of resources.

2.1.1 Origins of the Actor Model

The actor model was initially conceived by Carl Hewitt as a concept for concurrency [8, 9]. Severalworks refine the actor model and add further aspects [10, 11]. Actors were proposed as the mainexecution unit instead of using a sequential progression of code execution. Actors can be executedconcurrently without affecting each other. They serve as the primitive in computation resulting inprograms being entirely composed of actors. These actors advance computation by communicatingwith other actors and processing the messages they receive. All actors are limited to message passingas their sole means of communication and state sharing between actors is not allowed, as that couldpossibly lead to concurrent access of memory and thus cause incorrect behavior. In return, actorsare not bound to a location, which means that they can be assigned to any thread, processor, ormachine allowing them to move freely. This allows several actors with different locations to form adistributed system. Communication between actors is done in an asynchronous way, so that once amessage is sent, the sending actor is free to continue its computation. The delivery of a message ishandled by the actor system.

2.1.2 Actors and their Behavior

An actor consists of an address and a mailbox. On top of those components, an actor has a behavior,as well. There is also a list of acquaintances, which contains actors that are valid targets forcommunication. The actor also contains a state in case information has to be preserved in-betweenprocessing of two messages. However, an actor’s state can only be modified by the actor itself.Each actor only knows about its local state and any information outside of the actor has to becommunicated through messages. Figure 2.1 shows the composition of an actor. The behavior ofan actor defines how a message is processed. First, a message is taken from the message queue1 . Processing a message usually means either creating new actors 2a , sending a message 2b , orchanging the behavior for processing the next message 2c . Depending on the message, the actiontaken could also be to discard the received message without notifying the sender (not depicted inFigure 2.1). Creating an actor adds that actor to the creator’s list of acquaintances and it becomespossible to send messages to that actor 3a . Messages can also contain actor addresses, which can beadded to an actor’s acquaintances. The behavior for processing the next message is changed eitherby replacing the actor itself or by changing the actor’s state.

4

2.1 Actor Model

Functions, Operations,Methods

State

MessageProcessing

m0

m1 m2 . . . mn

Message Queue

Actord

Actorb

Actorc

Actord

Actore

Acquaintance List

2c

Actora

1

create

Actor

2a

new

Acquaintance

3a

send

Message

2b

Figure 2.1 – Schematic of an actor’s behavior

For instance, an actor could provide some computation service. One actor serves as that service’sleader and clients can request the service by sending a message to the leader. The provided servicecould require computation which would need some time to complete. In order to serve a request,the leader forwards the request to one of his workers who would then process the request. Toprocess several requests at the same time, the leader assigns more workers, as all workers canwork concurrently, which would cut the time to process multiple requests. However, the amountof workers is limited to the budget of the service. When a certain amount or volume of requestsis reached, the leader would reject following requests to prevent overloading the service. In thisexample, each person or worker is equivalent to an actor. The leader actor serves as the interfaceto the service. It creates actors as its workers and distributes incoming requests to all those actors.When a lot of requests are made, additional actors can be created to serve that increased load.Available resources for processing (i.e., threads, processors, and machines) limit the amount ofconcurrently active actors. Even though there is no conceptual limit to the amount of actors, creatingmore actors despite utilizing all available resources to their limits will not speed up the computation.It would just start to process more requests and increase the amount of requests in progress. Overall,each individual request would progress slower, as fewer resources are available to dedicate to onerequest. Rejecting requests after a certain threshold is equivalent to changing an actor’s behavior.This example shows that systems of actors can behave similarly to how work-sharing is performed

5

2.1 Actor Model

in the real world. However, communication in actor systems works differently from how it does inthe real world.

Message Processing Messages are processed in the order in which they arrive in. The order inwhich messages arrive is not determined, so that message A sent before another message Bcould arrive after B has been processed. The order is not set randomly, but rather it depends onthe circumstances in which a message is sent. These circumstances are significantly influencedby the environment in which the message is sent. The only guarantee provided is that once amessage has been sent it will be delivered. However, there is no guaranteed time after whicha message has to be delivered. In practice, messages are delivered on a best-effort basis, asconnections between different machines are prone to failures which would make a deliveryimpossible. These properties have to be taken into consideration when actor-based systemsare built. Either counter measures are implemented to handle this situation or actors have tobe indifferent to the order of arrival. To force a certain order one could implement a systemto acknowledge each request and reorder messages during their processing. Depending onthe complexity of the communication, it could also require actors to coordinate with eachother. Different actors are able to handle messages concurrently, but each actor has to processits own messages one at a time. This sequential handling in addition to each actor’s isolatedstate guarantees that state changes are always made in a consistent fashion.

Asynchrony Asynchrony plays an important role in the actor model. Sending messages is purelyasynchronous and actors are not blocked when they send messages during message processing.This avoids dependencies between actors which would make reasoning about deadlocks in sucha system highly challenging. As computation is driven solely by messages, replies are the onlyway to receive results. Consequently, an actor has to either wait for an answer synchronouslywhile blocking, or it waits asynchronously. Waiting asynchronously is done by receiving afuture which can be seen as a placeholder for the result. This future does not contain theresult, but is instead a container which is filled once an answer has been received. Actors cansubscribe to a future given that they have access to it and will receive a notification uponcompletion of the result. This is done in a completely asynchronous way without requiringthe actor to wait actively. Furthermore, message processing in general should be performedas asynchronously as possible. This allows actor resources to be used by other actors, whichresults in an increase of the level of concurrency.

2.1.3 Actor Communication

The communication between actors requires several different concepts. The address is used find therecipient of an message and deliver that message to the recipient. The mailbox is used to store anactor’s received messages. Furthermore, messages adhere to a certain form, which is explained asthe message composition.

Address Communication between actors requires them to have addresses. These addresses identifywhich actor is supposed to receive a message and respectively where a message is supposedto be delivered to. However, addresses can not be used to identify an actor itself. Actoraddresses are not bound to an actor and can be reassigned to another actor. Usually such areassignment does not change the semantic meaning behind an address. The communicationprotocol between two actors is, so to speak, a contract between them. This allows them toexpect a certain answer when an actor sends a message. It resembles an interface in otherprogramming models where the interface represents all possible interactions available to

6

2.1 Actor Model

clients. Besides addresses not being bound, they could also point to a group of actors insteadof a single actor. Sending a message to an address could potentially prompt several actors toreceive that message. Consequently, an actor sending a message does not know who exactlyis going to receive that message. Actor addresses behave very similarly to how addresses doin the real world. They can be used to send a message to a certain person, but the sendercannot know whether the intended receiver is still residing at that address. In the real world amessage which cannot be delivered is usually sent back to the sender. In actor systems that isnot necessarily the case as the receiver – even if it was not the intended one – is still allowedto process that message. This means that messages might be used, discarded, or returned.The actor’s behavior defines how a message is handled from this point on.

Mailbox In addition to an address, actors have mailboxes where messages are stored before theyare processed. When a message reaches an actor, it is put into the mailbox first. The actorthen processes each message by the order they arrived. Conceptually the mailbox does nothave a limit for messages and no message gets rejected. However, this does not mean thatthe actor itself will not reject the message during processing. In practice mailboxes have amaximum size, as buffers, storage, and memory are limited. Messages are not shared betweenactors, as the memory allocated to a message is not shared. Every time a message is deliveredto an actor, a copy is made and placed in the receiving actor’s mailbox. This ensures thatsynchronization is limited to adding and removing messages to a mailbox. Furthermore, italso enables actors to communicate with each other, even when those actors do not haveaccess to shared memory.

Message Composition Messages sent between actors contain at least a mail address and somecontent. The actor model does not limit the message content to any form or type, but theyshould refrain from containing pointers to local data, memory, or references. As communicationcan occur between two different machines, a pointer or reference to another machine could notbe dereferenced making such data useless. Other than that, a message can contain any dataor payload like in other messaging systems. Thus, messages could be used to model a functioncall as communication of actors. An actor would put the function name and its parametersinto a message and send that message to an actor which provides the desired function. Theresult would be similar to remote procedure calls, but implemented as communication ofactors. In such a system, the request message to another actor is equal to a remote procedurecall and the answer message with a return value is equal to the return value of a completedremote procedure call.

Integration of Entities Outside of Actor Systems Although every entity is supposed to be an actorin the actor model, developers can still use external code which has not adopted the actormodel. They can wrap that code into actors, but it is also possible to communicate with actorswithout being an actor. External entities are allowed to send messages to actors as long asthey have knowledge of an actor’s address. In case a response is expected, either a future or adummy actor has to be prepared in order to be able to receive the response message. Thisallows the external entity to wait for the future or for the dummy actor to return the expectedresult.

2.1.4 Actor System

Up until now, only actors and their interaction have been described. However, actors also needan environment, which is responsible for assigning processing resources and communication, in

7

2.1 Actor Model

ExecutionUnit A

t0 . . . tn

ExecutionUnit B

t0 . . . tn

Actor SystemActor0

Machine A

Actor1

Machine B

1executeActor0

sendMessage

2

3copy/serializeMessage

4

transfer5

deserialize/place in Queue

6executeActor1

Figure 2.2 – Message delivery in an actor system

order to operate. That job is performed by a runtime environment for actors called actor systems,as shown in Figure 2.2. Because the actor model does not impose any rules on actor systems,their implementation details differ depending on the design of the actor framework. Actor systemsallocate threads for actors, so that the actors can get executed (in Figure 2.2: 1© and 6©). Thisis handled depending on the scheduling strategy deployed by the actor system. For instance, anactor could get to process a certain amount of messages before the thread is reassigned to anotheractor. Another strategy could be to assign actors to dedicated threads which would defer schedulingto the operating system (OS)’s thread scheduling strategy. Accordingly, other thread schedulingstrategies can also be applied to actor scheduling. Actors, which do not have any queried messages,will not be scheduled. Aside from execution, they are also in charge of delivering messages. Whenan actor sends a message, it hands its message to the actor system which makes a copy of themessage and puts that message into the receiver’s mailbox (in Figure 2.2: 2© and 3©). Access toan actor’s mailbox has to be synchronized, as messages can be added while older messages arebeing processed on another thread. After that, the message is transfered to the machine which hoststhe target actor (in Figure 2.2: 4©). In case the target actor is located on another machine, theactor system has to serialize the message, transfer it to the other system, and deserialize it on theother machine again before finally adding it to the target’s message queue in (in Figure 2.2: 5©).Additionally, actor systems keep a registry of actors in order to be able to deliver messages to allcreated actors. When an actor is created, it is registered with the actor system, too. The actor systemalso saves the addresses associated with an actor, as well as notes about remote actors. Even thoughthe actor model does not use synchronization mechanisms, the actor system as its runtime has touse synchronization in order to ensure safe access to data structures shared between actors and theactors’ system. However, application developers who apply the actor model to their programs arestill relieved from using synchronization manually.

8

2.1 Actor Model

2.1.5 Distributing Actors

As already mentioned, actors are not bound to one location. The actor model was designed to allowactors to work together even when they are placed on different machines. Distributed execution ofactor programs was described in [12]. [13] showcases an actor language with focus on distribution.Distribution of actors has been established and become common. However, distribution introducesproblems which are not obvious. Firstly, data sent to remote actors has to conform to certainrestrictions, which have been mentioned earlier. Software developers have to be careful aboutwhich data they pass between actors to avoid buggy behavior. Secondly, data transmission betweendifferent systems introduces a significant overhead in the form of serialization and networking.The actor systems are forced to serialize data in order to ensure that the other machine is able toreconstruct that data. The computation time needed to perform serialization increases with thesize of the messages. This makes large or frequent messaging especially expensive with regardsto computation time. Similar effects also apply to the networking layer, which has to put themessages into packets. A lot of messages could produce a lot of packets, all of which requiring anacknowledgment on Transmission Control Protocol (TCP) connections. Lastly, the semantics andenvironment vary between different machines. Oftentimes, different machines are equipped withdifferent components or have different configurations. Commands and instructions might behavedifferently or might not be available which affects the actor’s behavior. Distribution of actors canhave a significant impact on the performance and behavior of actor programs and it adds additionaldimensions to an actor program’s scope.

2.1.6 Use Cases

Some use-case examples are given to illustrate the usage of actors in applications. A commonuse-case example is as a plain messaging system, where each actors serves as a client and messagesare sent between them in order to exchange information and notifications. Electronic mail is aconcrete example for such a system. Each account or participant would be modeled by an actor andthat actor’s address would be equivalent to an e-mail address.

Locks can also be implemented with actors by serializing access to a resource. In order to requesta lock, the user sends a message to the lock actor who replies as soon as the lock can be acquired.When the lock user is finished, an unlock message is sent back to the lock actor, which can thennotify the next user waiting for the lock. Incoming messages have to be saved separately as long asa lock is held, because during the time a new message is processed the lock could still be in use.Apart from that, the lock actor needs to be able to process incoming unlock messages at all times inorder to prevent deadlocks.

Actor systems can also be used to model web services. Usually, some kind of endpoint is providedby web services. These endpoints give access to the service by accepting and replying to requests. Aweb service using actors could encapsulate its service in several actors and provide an externallyreachable actor address as its endpoint for external users.

In Section 5.1 two applications are implemented, which are used for the platform’s evaluation.The first application is a key-value store, which provides a simple service for storing key-data pairs.The second application is a data-stream-processing application. This application receives data froma data source and processes that data by extracting information from that data.

9

2.1 Actor Model

2.1.7 Benefits of the Actor Model

The actor model provides a powerful system for concurrency with simplified semantics for applicationdevelopers. By separating execution and functionality, the actor model enables highly dynamicreconfiguration. Adding distribution to actors further enhances the actor model’s reconfigurationcapabilities. However, in order to achieve competitive performance comparable to traditionalconcurrency, the actor runtime requires sophisticated engineering and a refined design. It dependson those factors as much as other performance-critical pieces of software such as the JVM does [14].For a long time, the actor model has not been used widely, which Carl Hewitt, who is the creatorof the actor model, acknowledges in [9]. Up until recently, processor performance was increasedby improving processor clock speed and processor performance per clock cycle instead of addingcomputing cores to a processor. Consequently, models for concurrency did not seem as interestingas it is to computer scientists and engineers today. As mentioned at the beginning of this section,easy and fast improvement of a single core is not possible anymore and concurrency has moved tothe center of attention [6, 7]. This is where the actor model can shine again by providing tools tocope with the pitfalls of concurrency and introducing a higher level of abstraction, which seemsmore natural to humans than the direct use of threads does. These properties have helped raisingthe actor model’s popularity and usage. This is reflected by its usage in services provided by majorsoftware companies like Microsoft [15].

2.2 Akka Toolkit

“Akka is a toolkit and runtime for building highly concurrent, distributed, and resilientmessage-driven applications on the JVM.” — Lightbend Inc. [16]

This section presents the Akka toolkit. First, its history and development are described. Thenan overview of fundamental entities and components of Akka is given. After that, actor hierarchyis explained, which is another fundamental aspect of Akka. Following that, a short explanation ofAkka’s implementation of actors is given. Then message delivery and actor addressing in Akka isdescribed. Afterwards, Akka’s actor system is presented which is the central component of Akka’sactor implementation. Remoting and distribution is explained in the section after that. Lastly, adiscussion follows which covers the reasons for using Akka.

2.2.1 History and Development

Akka was first released by Jonas Bonér in 2009 ([17]), albeit in a slightly different form than thecurrent version. In 2010, Bonér introduced Akka in the way it is currently known as [18]. Hedescribes Akka as the actor model which provides a better platform to build concurrent and scalableapplications. The actor model supplemented with what he calls the “Let it crash” failure modelresulted in the core philosophy of Akka. In [19] the Akka implementation is compared to the formeractor implementation of the Scala Standard Library [5]. Originally Akka only consisted of the actorimplementation. application programming interface (API) bindings for both Scala and Java wereprovided with this release. Since its first release, Akka has been split into smaller submodules.For this work the actor and remote modules are of special interest and the cluster and persistencemodules are noteworthy. Akka is available as open source software (OSS) and is under guardianshipof the Lightbend (formerly Typesafe) company. Today it is freely available from their website [16]and as a set of pre-compiled libraries in various JVM library repositories. The documentation forAkka can be found at [20] and it serves as the foundation for this section.

10

2.2 Akka Toolkit

2.2.2 Akka Overview

Figure 2.3 shows an overview of Akka components which are commonly used to execute actors.The core component in Akka is the actor system. It takes on a role similar to hypervisors for virtualmachines (VMs) and provides an execution environment for actors. The actor system managesdispatchers which are in charge of managing execution resources. As the JVM does not supportexecution of actors, actors cannot be executed by themselves. Instead they are executed with thehelp of threads. The dispatcher contains an execution context which provides threads and it isconsequently the component which enables the execution of actors.

All actors are created through and managed by the actor system. By supplying a derived actorclass to the actor system, custom actors can be initialized. Derived actor classes contain user-definedbehavior which is basically how received messages are processed. In order to retain informationbetween different messages, custom actors can contain state which is accessed during messageprocessing. For the purpose of protecting an actor’s integrity, the actor system returns an actorreference which can only be used to communicate with an actor. The actor instance is a separateentity and usually not accessible.

The interaction in actor systems is limited to sending and receiving messages. Communication isperformed by sending messages to actor references. Akka provides a number of communicationsemantics like request–reply, only request, or forward. Despite Akka’s focus on actors, senders donot necessarily have to be actors. Other entities are also allowed to send and receive messages,however, these entities are executed separately from actors. When a message is sent to an actorreference, it is put into a mailbox. The messages in a mailbox are similar to a task for the executorservice. Consequently, each message is actually executed by an executor with the help of threads.Actors are purely a semantical concept and are not executed directly.

Actor System

• Execution Environment forActors

• Management of Resources

Derived ActorClass

• User-defined Actor• Contains Behavior and State

Actor Reference

• Actor Address Abstraction• Essential for Message Delivery

Sender

• Not necessarily an Actor• Different Semantics (with or

without Reply, Forwarding)

Dispatcher

• Execution Context• Usage of ExecutorService

(ThreadPoolExecutor andForkJoinPool)

Mailbox

• Separate for each Actor (default)• Or shared Mailbox• Message equals Task for

ExecutorService

Actor

• Actor Instance• Execution triggered by Mailbox• Referenced by ActorRef

Initialization

Message

creates

manages

delivers

puts in

part of

executes

Figure 2.3 – Structure of Akka components

11

2.2 Akka Toolkit

However, many aspects like actor hierarchy, the implementation of actors, and communicationrequire a more detailed explanation. These concepts are presented in the following sections.

2.2.3 Actor Hierarchy

Actors systems in Akka employ a hierarchy of supervision. That means that actors are always createdas child actors of another actor. The only exceptions are the guardian actors, which are directlymanaged by the Akka runtime. The reason for this requirement is Akka’s failure model. It makesparent actors responsible for the failure of child actors. If a failure occurs in a child actor, the parentis obliged to handle this failure. This failure can be propagated recursively until a supervising actor isable to handle the failure situation. Consequently, it allows subsystems of actors to repair themselvesin case of failures and it facilitates high uptime.

Actor systems should be built like the hierarchy of an economic organization where each functionis provided by a certain division. Each division has supervisors to monitor the division, but is alsooverseen by the leadership of the organization. Work is distributed to each division, which splitsthe tasks up and allocates the subtasks to individual workers. When a problem is encountered, theaffected workers request help from the supervisor in charge of the division, who can then escalatethe problem to higher positions if needed.

Figure 2.4 shows an example of an actor hierarchy. At the top of the figure is the actor system,which manages the guardian actors. These guardian actors supervise several different types of actorsdirectly. On the left-hand side is a group called division 1. This group consists of a supervisor andseveral worker actors. The supervisor is a direct subordinate to the guardians, which makes it atop-level supervisor. The second type of group is located below the guardians (in Figure 2.4: division2). This actor group implements a two-tiered supervision system which subdivides this group intotwo subgroups. Each of these subgroups has its own supervisor and workers. The workers aresupervised by the subgroups’ supervisors, while the subgroups’ supervisors are subordinates of thetop-level supervisor of division 2. Division 2’s top-level supervisor is in turn the direct subordinate ofthe guardian actors. The last type of subordinate is a simple worker which can also be supervised bythe guardians directly. After explaining actor hierarchies in Akka, a description of the implementationof actors in Akka follows in the next section.

2.2.4 Actor Implementation

Akka actors are containers for state, behavior, a mailbox, its child actors and a supervision strategy[21]. However, the containers are not directly accessible and are encapsulated in actor references.Interaction with actors is done entirely with actor references in order to shield the actor’s stateresulting in a public outer actor reference and a private inner actor instance. With this nothing butthe actor itself is able to reach the state unless the actor publishes its own references or information.Actor references can be passed around and they can also used to send messages to the referencedactor. As the actor is separated into an outer and an inner part, it is possible to restart, reset theactor, or place the actual actor incarnation on a remote host without invalidating the actor reference.

Actor State The state of an actor is part of the user-defined actor class and may contain anyvariables. It is strongly recommended to avoid adding shared variables as that may invalidatethe guarantees made by the actor model. Access to the state does not need any synchronization,as Akka’s implementation isolates access to those variables. They can only be modified duringthat actor’s message processing. If an actor fails, its internal state may get corrupted. Thisis why, restarted actors are recreated with a cleared state like when an actor is created from

12

2.2 Akka Toolkit

Actor System

Guardians

Division 1

Top-level Supervisor

Worker

Worker

Worker

Division 2

Top-level Supervisor

Supervisor

Worker

Worker

Worker

Supervisor

Worker

Worker

Worker

Top-level Worker

Supervision

Figure 2.4 – Exemplary actor hierarchy

scratch. However, it is possible to persist an actor’s state before it crashes. That persisted statecan then be used to automatically recover an actor’s state upon restarting. This is done bypersisting all received messages and replaying them after restarting the actor.

Message Delivery Akka’s message guarantees slightly differ from the actor model. First of all,messages are not guaranteed to arrive anymore. Although messages are usually not lost inthe case of actors residing in one JVM on the same machine, they may still not arrive at therecipient. Secondly, messages from multiple actors can arrive in an undefined order. However,messages sent from one actor are always ordered. That means that if actor A sends messagesm0, m1, and m2 to actor B that actor will receive them in exactly that order (m0, then m1,then m2). If messages m0 and m1 are sent by the actor A, but m2 is sent by another actorcalled C to actor B, then either m2 arrives first and then m0 followed by m1 arrive afterwards,or m0 and m1 arrive before m2 does. Physically, m2 can arrive between m0 and m1, but Akkareorders them before passing them to the recipient. Consequently, m0 is always deliveredbefore m1 is delivered, because they are sent from the same actor, which is actor A. However,the order of delivery for m2 is undefined and can happen at any point in before, between, orafter the other messages are delivered, because it was sent by another actor and its messagesare thus independent from actor A’s messages. There are several mailbox implementations to

13

2.2 Akka Toolkit

choose from. Available options are FIFO queues and a priority queue is also provided by Akka.The mailbox order is independent from the arrival order of messages. Any messages placedin the mailbox has to be processed by the actor. Failing to handle a message is treated as afailure, which could lead to the termination of the actor. This is different to not processing amessage, which actors are allowed to do.

Actor Addressing The actor model requires that actors possess an address, which can be used tosend messages to an actor. Addressing in Akka is separated in several different components,where each of those components has its own purpose. Some components are used for messagingactors, while others provide tools for actor identification. Furthermore, the hierarchy of actorsin Akka has large influence on addressing actors. This is reflected in the hierarchical nature ofactors in Akka. Figure 2.5 shows the relationship between the addressing components andthe addressing hierarchy in Akka.

Actor References The entity which resembles an address the most is the actor reference. It can beused to send messages to a certain actor, but is only valid for that actor. An address in thetraditional sense is more similar to an ActorPath which points to a “location”. Each actorhas a context which contains the methods of the Actor API[22]. The context also providesthe self method which returns a reference to the actor itself. This reference is also attachedto all messages sent by an actor which also makes it the reference the receiving actor canaddress when an answer is sent back. Actor references are tied to an actor’s lifecycle and thereference is valid as long as the corresponding actor is still alive. When the actor is stopped,the reference will still accept messages. However, all received messages are then designatedas dead letters and put into the dead letter mailbox. This mailbox can be used to check forfailures or can be accessed for debugging purposes. Each reference contains a path which canbe used to locate the actor.

Actor Path An ActorPath is built very similarly to a unified resource locator (URL). The pathbreaks down into the used transport protocol (akka.tcp in Figure 2.5), the actor system andhost where the actor is located on (sys@host:2552 in Figure 2.5), the names of its ancestors(in Figure 2.5 MyChild actor: /user/parent), and its own name (in Figure 2.5 MyChildactor: child). This example is illustrated as the bottom row in Figure 2.5. Contrary to actorreferences, actor paths are independent from actors and can be created without creating anactor. There are two types of actor paths. The first type are logical actor paths, which representsupervision hierarchy. The same type as featured in Figure 2.5. The second type of actorpaths are called physical actor paths and they describe the functional location within oneactor system. As actors can be deployed within another actor system on another machine,these actors use a different path. Physical actor paths include deployment information aboutsupervisors deployed on other machines. Although actor paths are very similar to file-systempaths, they do not incorporate the concept of symbolic links. It is not possible to arbitrarilycreate an actor path to refer to another actor. Actor paths are always either logical in order torepresent the supervision hierarchy or physical in order to describe how to reach an actor.

Actor Selection Usually, actor references are returned when an actor is created, but it is also possibleto look up actors using their path. The returned actor selection can be used to send messagesto the corresponding actors. However in contrast to an actor reference, an actor selection isnot bound to an actor’s lifecycle. Additionally, an actor selection might contain more than oneactor, as actor selections support wild cards to match with actor paths. Sending an Identifymessage will prompt the receiving actors to answer with a message containing the receiver’sactor reference. This method can be used to find individual actor references.

14

2.2 Akka Toolkit

Actor System

akka.tcp://sys@host:2552

Actor

Actor

class MyChild

Actor

class MyParent

Actor

class Guardian

ActorCell

ActorContext

ActorContext

ActorContext

ActorRef

Some Child

child

Some Supervisor

parent

Guardian Supervisor

user

ActorPath

akka.tcp://sys@host:2552/user/parent/child

akka.tcp://sys@host:2552/user/parent

akka.tcp://sys@host:2552/user

.context

.context

.context

.self

.self

.parent

.children

.self

.path

.parent

.path

.parent

.path

Figure 2.5 – Akka actor paths, references, and addresses (based on[23])

Actor Systems An important aspect of Akka actors is the implementation of the actor system. Itprovides execution resources and services which are used to enable execution of actors. Usually,the actor system is created with a configuration, which specifies how subcomponents areinitialized. The configuration is read from a user-defined file, but a default configuration isprovided by Akka, as well. Aside from initialization, the actor system is also responsible forshutting down and freeing up all resources it initialized during startup.

Many actor-system functions are delegated to subcomponents managed by the actor system.A major component for actor execution is the dispatcher. Actors are executed with the helpof threads. Those threads are managed by executor services which are provided by Java’sstandard library. The dispatcher contains an executor-service instance, which it uses to enablemessage processing of actors. Furthermore, an actor system is not limited to a single dispatcher,but can contain several dispatchers each with their own pool of threads.

Apart from the dispatcher, the actor system is also managing a component called actor-referenceprovider. Even though the actor system provides the interface for actor creation, it is actuallythe actor-reference provider which is responsible for actor creation. Special informationabout an actor’s deployment can be provided in order to denote conditions such as remotedeployment. This information is taken into account during the creation of that actor. Thereference provider is also responsible for assigning actor paths. It manages the guardian actorsand special actor references like the dead-letter mailbox. As the provider keeps record of actorpaths, it is also used to lookup actors by path or name.

However, an actor system provides this functionality to local actors only. In case of dis-tributed execution, Akka uses multiple actor systems distributed over multiple machines. Thecircumstances and conditions of distribution are explained in the next section about AkkaRemoting.

15

2.2 Akka Toolkit

Akka Remoting

A crucial aspect of Akka is the distribution of actors, which is implemented as the Akka Remotingmodule [24]. This module adds a special type of actor reference which indicates that an actor ispart of a remote actor system. This module is also fundamental to this work’s implementation, as itserves as the basis for distribution and migration.

Distribution Transparency Distribution is a core part of Akka’s design, as the interactions betweenactors are based purely on message passing and almost all tasks are executed asynchronously.This enables transparent implementation of distribution without forcing changes on an actor’simplementation. Remoting does not require direct changes to the source code and it is purelyconfiguration based. All actor functions are independent of an actor’s location (i.e., on asingle machine or on a cluster of machines). This results in location transparency in Akkaapplication.

Remote Messaging Despite being distribution transparent, most of the restrictions of distributedsystems are present nonetheless. Messages can and will get lost if the connection is interruptedor a machine crashes. When messages are not lost, messages can still be delayed significantly,which needs to be considered when using remoting. Other than that, messages have to beserialized in order to be sent to remote actors and thus messages have to be serializable.

Peer-to-Peer Communication Communication between actor systems on different machines isdone in a peer-to-peer fashion. There are no clients or servers and all actor systems communi-cate on equal terms. This allows remote actor systems to create actors on a local system, justas a local system can create actors on a remote system. As a result, all involved actor systemshave to be added to the configuration in order to make them available, as dynamically addingnew remote systems is not supported. In a local-only scenario, messages are put into sharedmemory and they can use the system’s local representation, which can prevent the usage ofserialization.

Serialization The remote scenario does not have access to shared memory, Instead it sends messagesover a network to a remote system. Because of that, serializers are crucial components for adistributed Akka application. Everything inside a message has to be serializable. This includesall data structures referenced by the messages. The Java Serializer is used by default, but,as its performance is rather poor, it is recommended to implement custom serializers. Thishas a major effect on the performance of remoting, because the process of serializing anddeserializing is performed for every message sent to and received from remote actors. Asmessages for remote actors are sent over a network, their references and paths work differentlycompared to how they do in a local-only system, which is explained in Section 4.1.6.

2.2.5 Benefits of Akka

The choice to use Akka instead of any other actor implementation was made, because it is based onthe JVM, has a large user base, support for distribution, and its performance is high compared toother actor implementations [14]. Nonetheless, there are many other implementations like Orleans[25], Vert.x [26], or Reactor [27], which are also popular, but did not meet this work’s needs. Usingthe JVM eased the development, as Java is widely used and a lot of information on it is freelyavailable and accessible. Aside from that, it is also widely used in enterprise systems and thusrelevant for large-scale applications. Because Akka is used as the replacement for the deprecatedScala actor implementation, Akka is employed by several Scala projects for its actor implementation.

16

2.2 Akka Toolkit

It is used as the base for other projects like the Play Framework [28] or Apache Spark [29] andis deployed in production environments. Its performance is also competitive compared to otheractor implementations and as such Akka was not restricting in that regard. Akka implements theactor model by providing corresponding building blocks (e.g., actors, mailboxes, addresses, andasynchrony) and supplementing them with a concept for fault handling in the form of hierarchy andsupervision. By strictly separating the public communication interface from the actual actor object,Akka achieves isolation of state and restricts interaction to message passing. Although addresses arenot directly implemented, there is a conceptually sound solution by using actor references, actorpaths, and actor selections to provide the address functionality for different circumstances. The Akkaactor system manages resources such as threads and utilizes configuration by declaration to setupits environment. Distribution is implemented in a transparent way resulting in distribution-agnosticapplications without requiring changes to their implementations.

The actor model alone does not help with saving energy in any way. In order to do that, methodsto limit the energy consumption of a system and methods to monitor energy values are needed.These methods are described in the next section and provide another basis which is used togetherwith Akka and the actor model to achieve lower energy consumption.

2.3 Energy-aware Computing

Energy consumption makes up a large portion of the operating costs of data centers [30, 31]. Evenwhen the machines are idle and not in use, they still consume a substantial amount of energy. Theyare kept alive because they are needed as hot-standby resources or contain data which is requiredby other machines. Ideally, a server would consume almost no energy when no work is done (e.g.,when its provided service is not processing any requests); and when the load increases, the serverwould only use as much as is required to process the additional load. This ideal state is called energyproportional [31]. There are different approaches to improve energy proportionality. The most directapproach would be to engineer energy-proportional computing hardware; especially processorshave made large steps towards that goal [32, 33, 34, 35]. However, development of hardwarerequires a lot of time and knowledge, which makes the pursuit of this approach only viable to majorhardware vendors. Software design has a large influence on a machine’s energy consumption aswell. Software should only use as many resources as it needs to process its workload. Many times,resources are used despite not requiring them, which results in a decrease in efficiency. This sectionfeatures approaches to decrease the energy consumption while also increasing a system’s energyproportionality. For this purpose, energy measurement-methods are introduced first. They providethe data in order to monitor the effectiveness of this work’s approach and the data is used by thiswork’s implementation for its decision making.

2.3.1 Energy Measuring Methods

Hackenberg et al. show three different measuring methods in [36]. The first method is AC instru-mentation. They attach a measuring device between the machine’s power supply and the electricaloutlet. This enables measurement of the energy consumption of a whole system including the CPU,mainboard, memory, hard disks, and the power supply unit (PSU). A similar method is used by thiswork, although a different measuring device is used (see Section 4.5.2).

AC instrumentation provides data about the power consumption in watt, as well as the energyconsumption in joule. The second method is DC instrumentation. This is done by attaching a powermeter in-between the PSU and the power connectors on the mainboard. This provides power and

17

2.3 Energy-aware Computing

energy information of the processor, mainboard, and memory. However, DC instrumentation is notused by this work, as it would require soldering and special equipment which is beyond the scopeof this work. Another reason not to use this method, is that this measuring method is usually notprovided inherently and requires customization of hardware components. Thus, it would not beavailable to most users and not useful as an information source in a production environment.

The last method is using processor functionality. Unfortunately, not all processors providemethods to measure or estimate its energy consumption. Intel and AMD CPUs provide an interfacewhich contains information about energy consumption. Intel’s interface is called Running AveragePower Limit (RAPL) and is provided by the model-specific registers (MSRs) of Intel processors.Although its main purpose is to set a limit on the processor’s power consumption at runtime, itcan also be used to gather energy data. In the CPU generation in which RAPL was introduced,it was still based on an energy model and not the result of an actual measurement. With theintroduction of the Haswell processor architecture, this has changed and since then it has been basedon actual measurements [37]. The values are even split into different domains depending on theCPU type. All CPUs provide information about the energy consumption of the whole CPU package(PKG domain), as well as how much is consumed by the CPU cores (Core domain). Additionally,server CPUs provide information about the consumption of memory (DRAM domain), while desktopCPUs contain energy information about the internal graphics component (Graphics domain). Thismethod is also used as an information source in this work’s implementation. AMD processors providea similar mechanism called Application Power Management (APM) [38]. APM is mainly used forthermal design power (TDP) limiting. However, energy data is not directly available, but has to becalculated using several northbridge registers which contain the processor’s current TDP value. Thismethod was not incorporated into this work’s implementation, as only Intel processors were used.

2.3.2 Reconfiguration for Energy Efficiency

Usually, an application’s resources are assigned before the start of that application. As alreadymentioned, workloads change with time and resource demand changes as well. Adjusting a machine’sprocessing capacity to match the demanded work is another approach to lower a system’s energyconsumption [39, 40, 41, 42, 43]. This approach can be referred to as scaling locally or vertically, asa machine’s performance is increased, which scales the machine up, or decreased, which scales themachine down. A machine’s resources are adjusted to the actual demand of the running applicationsby checking how much processing resources are required. Techniques like dynamic voltage andfrequency scaling (DVFS) are used to tweak the CPU’s voltage and clock frequency. Additionally, CPUcores can be deactivated to save energy and processor functions like RAPL provide more fine-grainedcontrol of power limits (see Section 4.4.4). At the same time, information about application load,system load, and energy is required to decide whether resources should be increased or limited.When the load is high, more resources are released, while lower load results in more resourcelimitation. This work’s implementation exploits configurations to lower energy consumption, aswell. These configuration methods include RAPL to set power caps on the CPU, deactivation of CPUcores, and limitation of available application threads. Furthermore, many information sources areincluded to allow monitoring of vital information like the system and CPU energy consumption,number of executed instructions, and load values on system and JVM level.

18

2.4 Distribution and Migration


Other approaches increase energy efficiency by placing different applications on the same machinein order to achieve higher average load [44, 45, 46]. This can be referred to as scaling horizontally,because more machines are added to the resource pool. As servers in a cluster are often at low levelsof utilization, consolidation of resources has a large potential for savings in computing resourcesand energy. The placement of an application on different machines has thus a considerable impacton its energy and resource efficiency. Apart from the initial placement of applications, they can alsobe moved after running for a while which is called migration.

2.4.1 Migration Levels

Migration can be done on different levels which is illustrated in Figure 2.6. [47] shows migration ona task or process level which takes place on a local CPU only. Whole applications can also be movedfrom one machine to another. They are either moved directly which needs to be supported by theapplication, or they are moved inside virtual machines (VMs), which contain both the application andits state. There is also migration on application-level which means that only parts of an applicationare moved, whereas other parts remain in place. However, application-level migration is not limitedto parts of an application, but can also be used to move the application in its entirety.

2.4.1.1 Virtual-machine Migration

Migration on the level of VMs is a widely used method to scale applications, as VMs are commonlyused to execute applications in data centers [48, 49, 50, 51]. VMs are convenient, as applicationscan be contained within a VM and all required resources and runtime elements are bundled in anisolated container. VMs are often used in synergy with cloud computing which provides a seeminglyinfinite amount of resources for VMs. Aside from the amount of resources, cloud infrastructures alsoallow elastic scaling according to current user demands. Furthermore, using VMs allows operatorsof data centers to utilize their available computing resources to their fullest, while still maintaininga satisfying service quality for customers.

Different strategies are employed to determine where an application or VM should be placed ormigrated. Some works focus on quality of service (QoS) [50] to ensure that certain requirementsare not violated when the execution context of an application changes. A QoS metric is, for example,the latency which reflects the time needed by a service until a response is sent to the client. In thisscenario, application placement is entirely dependent on whether the latency of a service can bemaintained. Other approaches monitor the machine’s load levels and move applications and VMsaccordingly. Most of the time, applications are put on as few machines as possible to improve theaverage load of each machine, so that the total energy consumption is reduced. Another aspect ofinterest is the application itself. Many applications are built in a modular manner and thus dividedinto several sub-components. These different components depend on each other, which restrictsthe possible placement options for those components. Furthermore, they might require certainresources or execution platforms, which puts additional restrictions on their placement. In order toscale an application with VMs, the application is started in multiple instances. These instances arecoordinated to provide a service with scaling capabilities.

Figure 2.6a shows the migration of VM1 from machine A to machine B. The application in thefigure is the same on all VMs, however each VM contains its own instance. Each machine has ahypervisor which manages the VMs on that machine. Before the migration, machine A contains two

19


Machine A

Hypervisor A

Application

VM0

Application

VM1

Machine B

Hypervisor B

Application

VM2

Migration

Machine A

Hypervisor A

Application

VM0

Machine B

Hypervisor B

Application

VM1

Application

VM2

(a) VM-level migration

Machine A

ActorSystem A

Application

Actor0 Actor1 Actor2

Machine B

ActorSystem B

Application


Migration

Machine A

ActorSystem A

Application


Machine B

ActorSystem B

Application


(b) Application(actor)-level migration

Figure 2.6 – Migration on VM- and applicaton(actor)-level (results at the bottom of each figure)20


VMs (VM0 and VM1), while machine B contains only one VM (VM2). After the migration, machine Ais left with one VM (VM0) and machine B now holds two VMs (VM1 from machine A and VM2).

2.4.1.2 Actor Migration

In this work migration is performed at the level of application parts. This is made possible by using theactor model, as that allows moving individual actors or actor groups which make up the application.This kind of migration is at a lower level than migration of VMs or full applications, as migrationon that level is costlier in terms of memory and time compared to moving actors. Moreover, actormigration can be viewed as a type of container migration where an actor is moved from one actorexecution environment to another like a VM can be moved from the hypervisor of one machine tothe hypervisor of another machine. Actor migration is also similar to the earlier mentioned approachwhere tasks and processes are migrated between processors. As actors contain isolated state andfunctions, they can be seen as tasks with a higher abstraction. Therefore, they can be moved betweenprocessors and they can also be relocated inside a distributed system. Because of those aspects, actormigration strikes a compromise between ease of migration and granularity. However, while VMs cancontain any application, actor systems require applications to incorporate the actor model. Thus,while actor migration is not applicable to all existing applications, migration at VM level can handleany kind of application. Switching to the actor model might require a complete reimplementation,which can be costly and time-consuming. However, migration at actor level has other advantagessuch as location and distribution transparency in return, as those characteristics are core conceptsof the actor model. Another disadvantage of VMs is that application scaling with VMs is not trivial.While moving a VM to a more powerful machine is simple, adding more or removing VMs in order toscale an application is challenging. This cannot be done without information and support from theapplication developer. In contrast, actors can be scaled much simpler, as the amount of actors canbe reconfigured at runtime and many actor runtime environments provide mechanisms to balancethe load automatically.

Figure 2.6b illustrates migration of an actor from one machine to another machine. All actors inFigure 2.6b are part of the same application. Each system hosts an actor system which managesthe application’s actors. Before the migration, machine A contains three actors Actor0, Actor1, andActor2, while only two actors are executed on machine B, Actor3 and Actor4. After the migration,machine A is left with Actor0 and Actor1, but now machine B hosts three actors Actor2, Actor3, andActor4. In conclusion, migration can be used to potentially lower energy consumption and improveenergy efficiency. Even though, different levels of migration exist, actor-level migration was chosenfor this work’s implementation, as it allows fine-grained control, while also keeping the complexityof migration at a reasonable expense.

Class Platform Power range Work load properties

Low-range ARM/Intel Atom ≤ 10 W Low number of requestsMid-range Low-power x86 CPUs 10 W–25 W Moderate number of requestsHigh-range Desktop/server CPUs ≥ 25 W High number of requests

Table 2.1 – Device classes

21


2.4.2 Heterogeneous Hardware

Combining heterogeneous hardware with migration increases energy efficiency even further, asdifferent scenarios have varying performance requirements. Using matching hardware can thussatisfy the performance requirement while also consuming less energy than using only one type ofmachine.

Breakdown of Energy Consumption The amount of energy consumed by a machine is made upof a static part and a dynamic part. The static part is always consumed when the machineis powered on. This part is the bare minimum required to keep the machine on withoutperforming any work. In contrast, the dynamic part changes depending on the performedwork. A machine at full load can use more than double of the energy it uses while it is idle.This means that the dynamic part can be much larger than the static part, but this part can beadjusted to the workload level, while the static part cannot be reduced in any way.

Device Classes A lot of energy can be saved by using machines with a lower static energy footprint.These types of machines also have a comparably low dynamic energy consumption. Forthis purpose, this work proposes three classes of machines shown in Table 2.1, which areintroduced in the following.

– Low-range Class The low-range class has the lowest energy consumption, but its perfor-mance is also very low. This class is characterized by its maximum power output of 10 Wand lower. Thus, low-range devices are ideal for small work loads (i.e. low number ofrequests). Devices using ARM or Intel Atom processors are typical members of this deviceclass. Aside from the processors, these devices usually do not contain a lot of memory orstorage, which limits their uses. However, they are exceptionally attractive as computingdevices because of their particularly low energy consumption.

– Mid-range Class Higher workloads require more powerful machines, as low-range devicesreach their performance limit rather quickly. This domain is filled by the mid-range class.It provides moderate performance, while requiring less energy than the largest systems.The maximum power output of mid-range devices ranges from 10 W at idle to about25 W at full load. These machines are suitable for moderate workloads which cannotbe handled by low-range devices, but do not warrant usage of the strongest machines.Typical devices of this class use low-power x86 processors such as the Core U-series byIntel. Similar to low-range devices, mid-range devices are also limited in their features,which means that only a small amount of memory and storage space is included in thosemachines. Although, they are weak compared to machines with full-fledged desktop orserver processors, the mid-range devices provide substantially better performance thanthe low-range devices, while also requiring only about half of the energy compared tothe most performant devices. The Next-Unit-of-Computing (NUC) form-factor by Intelserves the mode for the mid-range class, as it utilizes low-power components which canstill provide decent performance.

– High-range Class Lastly, the high-range class covers all workloads which cannot be handledby the mid-range devices anymore. These machines operate at a maximum power outputof 25W and above. Typically, high-range devices use desktop and server processors whichcan provide the highest possible performance. Furthermore, these devices can containlarge amounts of memory and storage, which contribute to their comparably higherenergy consumption. Additionally, they are able to process high workloads, but theirenergy consumption is considerably higher than that of the other devices’. As this class

22


covers a wide range of devices, devices belonging to this class can differ significantlyand thus the high-range class is only used to classify machines surpassing the mid-rangeclass.

These different machine classes enable this work’s implementation to adapt to different workloadsby matching the workload to a combination of devices. As the classes are heterogeneous, matchescan be made to be more accurate than using homogeneous devices. This can be combined withconfiguration of a machine’s resources as described in Section 2.3.2 to match workloads even betterand in turn save more energy.

The device classes, which are described in this work, are not absolute and their boundaries arevariable. They can differ depending on the applications’ purpose. Other applications might benefitfrom other device classes with different properties. In addition to that, the available hardwarealso influences the classification of the hardware. Naturally, this work’s device classification cannotconsider all available hardware. Especially, newly released devices with new or different propertiesare often not considered. These devices could make current classes obsolete or add new classes tothe classification. Particularly, uncommon and special hardware could provide advantages whichwould warrant the addition of more classes. Thus, the device classes show how the devices in thiswork were chosen and labeled, but they should not be seen as a definite classification of thesedevices.

2.5 Conclusion

This chapter describes the fundamental concepts and technologies upon which this work’s imple-mentation is built on. The actor model provides a high abstraction of concurrent programmingwhile presenting a way to separate functionality from the resources which are used for execution.Actors serve as containers that can be moved and migrated in a simple way, which is similar tohow VMs are migrated between different machines; however, being faster and requiring transferof less data. Techniques such as energy monitoring are used to regulate and configure a machinein order to minimize energy consumption while maintaining the required resources to process thecurrently applied workload. By using heterogeneous hardware and actor migration, even betterenergy efficiency is achieved, as workloads can be matched more precisely. These methods are thenimplemented using the Akka toolkit, which is in widespread use, as it is based on the JVM, usedby many enterprise applications. This results in a potentially large audience and enables workingwith familiar technology. The next section is going to present the design and concept of this work’senergy-ware middleware platform which incorporates the techniques and methods covered in thissection.

23

3D E S I G N A N D A R C H I T E C T U R E

The ReApper platform proposed in this work lowers energy consumption and improves energyefficiency by means of configuration, information monitoring, and distribution of application compo-nents on heterogeneous hardware. The fundamentals for these means have been established in theprevious chapter, while this chapter focuses on the design and concept of the ReApper middlewareplatform.

First, the design is described. Main focuses are monitoring of information and options forconfiguration. Information monitoring and configuration are performed on various levels of amachine. There are four levels. The lowest level is the hardware level. On top of the hardware levelis the OS level. The level above the OS level is the platform and the highest level is the applicationlevel. Functions at the hardware level provide access to low-level aspects of a machine like the CPU.The OS level is used to allow control and to give information about resources directly managed bythe OS. The platform level makes resources and data about the used runtime environment suchas the JVM available. At the application level details about specific application parameters aregiven and reconfiguration inside the application is performed. By using all available information,configuration is performed at the aforementioned levels in order to achieve different goals, such asachieving energy savings, reaching certain performance levels, or providing a particular quality ofservice.

Second, several facets of Akka are revisited and explained in greater depth, as the implementationexploits those features in order to enable reconfiguration and information collection. Actor systemsand their subcomponents are revisited, as modification of those enables reconfiguration of platformresources such as threads. Modification and exploiting features of other components enable migrationat the application level for actor migration. The structure and implementation of actors are used toadd migration capabilities. Actor supervision and monitoring provide information about an actor’sstate as it is being migrated. The capabilities of Akka Remoting are used to distribute the actorsystem over multiple machines. A more detailed view of several aspects of Akka is required toexplain restrictions and functions of the following implementation.

Third, the next section gives more details about the ReApper middleware platform. As thefunctions are divided into several modules, each of those modules is described in detail. Theconfigurator provides configuration methods, while the gatherer is used to access the informationgathered from different parts of the system. Their functions are implemented with the help of varioustechniques like manipulation of CPU registers, platform interfaces, and OS tools. The componentresponsible for actor migration is strongly interweaved with Akka and exploits several features ofAkka to enable migration. Furthermore, usage of those components is also described, as the purposeof ReApper’s implementation is to provide a platform for applications. In order to understand the

25

3 Design and Architecture

implementation details, the design and concept of ReApper’s implementation is described in thischapter.

Lastly, the aspect of distribution is added to configuration and information monitoring. Theexecuted application is distributed over heterogeneous machines, which adds another dimension tothe application’s configuration. This resulted in a mechanism for actor migration. It enables parts ofthe application and the application as a whole to be moved to other machines. This configurationoption is then used by the platform to match a machine’s resources to the application’s workloadrequirements, which allows shutting down unused resources and usage of more energy-efficienthardware for lower levels of workload.

3.1 Concept

This section describes the design and concept of ReApper’s middleware platform. The main goal ofthe ReApper platform is to reduce energy consumption and at the same time keep the performanceand quality of an application as close to original levels as possible. That means that an applicationexecuted with ReApper’s customized Akka runs at a similar level of performance as when it is executedwith an unmodified version of Akka. Even though the performance of ReApper’s customized Akka isslightly lower, as additional computation and processing is performed, ReApper’s implementationminimizes the impact of these factors. Thus, a comparable level of performance compared to pureAkka is still achieved.

Another goal is to keep as much of the Akka API as possible so that the platform requires noor only minor modifications to an application’s source code in order to support the added energy-saving functions. This allows application developers to adopt their existing actor application to thisplatform with little effort. It also keeps applications written for ReApper’s platform compatible toAkka. However, using additional features like configuration and information monitoring requiresinput by the application developer. Application developers either use these functions manuallyor provide thresholds and parameters which are taken into consideration when reconfiguration isperformed.

In compliance with the goal of lowering energy consumption, configuration and informationmonitoring is employed. This divides the design into two different halves which are both usedto ultimately adapt to different circumstances. Information is evaluated and either a differentconfiguration is utilized to match the requirements and demands, or the current configuration isretained, because requirements and demands are still satisfied. The main indicator for requirementsand demands is the workload which is currently applied to the application. Consequently, limitson resources are lifted when the applied workload increases, while the usage of resources is morerestricted when the workload decreases. The information provided by the platform is used todetermine the workload and energy consumption.

Distribution provides another tool for configuration, which is used to scale the application. Thisis combined with other configuration tools and information monitoring in order to further lowerenergy consumption of applications. Heterogeneous hardware possesses properties, which can beexploited to improve the energy efficiency of applications. Applications which are executed on theReApper platform can be moved to more energy-efficient hardware when the required computingresources are low. Higher levels of workload can be handled by moving the application to morepowerful machines; however, these machines also use more energy. Additionally, configurationcan also be applied to these different types of machines individually and information from eachmachine is evaluated to control the application’s movement. Thus, the provided resources are more

26

3.1 Concept

closely matched to the actual demand which is experienced by the application and overall energyconsumption is reduced.

The next section shows an overview of the configuration and information-monitoring aspects ona single machine.

Design Overview

Configuration and information monitoring is performed on four different levels. Figure 3.1 shows thedifferent levels with configuration tools on the left-hand side (in Figure 3.1a) and the informationarchitecture on the right-hand side (in Figure 3.1b). The hardware level mainly utilizes powercapping, which is performed by exploiting CPU features. This level also provides information aboutenergy and power consumption on one hand and information about performance on the other. Thenext higher level is the OS level. This is where configuration is performed by disabling and enablingCPU cores and by pinning individual threads to certain CPU cores. Regarding information, this levelprovides information about the system’s load from the view of the OS. Additionally, informationabout processes other than the application run by the platform is also available. Above the OS is

Application Level

Platform Level

Operating-system Level

Hardware Level

Actor Migration

Thread Number Actor Migration

Thread Pinning Core Control

Power Caps

(a) Configuration architecture

Application Level

Platform Level

Operating-system Level

Hardware Level

Throughput QoS Information(Latency)

Load Information(Process, Threads)

Load Information(System)

Energy and PowerInformation

PerformanceInformation

(b) Information-monitoring architecture

Figure 3.1 – Architecture of configuration and information monitoring

27

3.1 Concept

the platform level. As the platform manages resources for the application, it can be used to controlthe number of threads. Other than that, the platform also provides load information about its ownprocess and threads. The last level is the application level. Actor migration is performed by theplatform level, however the application level is affected the most, as migration has major impactson the application design, as well as on the application’s operation and performance. Furthermore,the information available at the application level is essential for the reconfiguration process, as itindicates whether the application is performing at an acceptable level. At this level, any informationabout the application’s performance, such as throughput or other QoS indicators like latency, is ofinterest. As these aspects are logically related on those levels and their levels of abstraction aresimilar to each other, they were divided in these four layers. The following sections present a moredetailed explanation of each of these levels, while also discussing why the methods used on thoselevels were chosen.

3.2 Hardware Level

The hardware level is the lowest level in a system. Usually, features on this level access thehardware directly without much abstraction or redirection. A machine is generally made up ofseveral components. At the very least, a machine contains a processor, mainboard, memory, storagedevices (e.g., SSDs, HDDs), and a power supply. In order to save energy, only components whichare necessary for operation of applications should be installed. Furthermore, using more energy-efficient hardware will also lower the energy consumption. However, there will be impacts on theperformance as lower energy consumption is often coupled with lower performance.

Configuration

Configuring these components when high performance is not required can lower energy consumptioneven further. However, most components do not support such configurations. In this work, theonly component of interest is the processor. Although configuration options are also offered byother components, their impact was deemed to be too low for the kind of applications in thiswork. Configuration at this level can potentially save a lot of energy, as the processor is a majorenergy consumer in a machine. A function which most processors support is DVFS [52]. It lowersa processor’s power consumption and thus also energy consumption by reducing the frequency atwhich the processor is operating. A lower frequency allows the processors to be operated at a lowervoltage. Aside from DVFS, some processors also provide an interface to limit the power consumptionat certain values. This method is called power limiting or power capping. Its limits the energyconsumed by the processor, as long as that limit is active and allows operation of the processor ata very low power level while only low performance is demanded. As soon as higher performanceis required, a higher limit can be set in order to satisfy the demand. Limiting the processor to alevel of power consumption which is enough to handle the application’s workload prevents theprocessor from providing more performance than needed and using more energy than required forthat purpose. This results in a large potential for energy savings. On top of that, setting a power captakes effect almost instantly which allows quick adjustments to changing circumstances. Currently,the power-capping feature is not available on all platforms. The type of machines on which thisfeature can be used is still limited, but a few manufacturers have started to adopt this technology,and its availability is steadily rising.

28

3.2 Hardware Level

Information Monitoring

The information available on this level is also mainly provided by the processor. Other componentsdo not have an interface to provide information. In this work, these components are assumed toconsume energy at a static rate, as detailed changes in their energy and power consumption cannotbe captured. Some processors are able to provide information about energy and power consumption.This information can be used to monitor the energy consumption and verify whether configurationmeasures have taken effect. It is also essential for determining the energy consumption of differentconfigurations. As this information is directly collected by the processor, it does not require anyadditional components or devices. However, not all processors provide this feature and once againnot all machines can make use of this.

However, another method can be employed to collect energy and power information. An energymeasuring device can be used to measure a system’s total energy consumption, which is described inmore detail in Section 4.5.2. The information provided by the measurement device represents onlya snapshot of the current energy and power consumption of the measured machine. In contrast tothe information provided by the processor, which is accumulated automatically, this information isnot collected automatically and is thus fetched from the measuring device in regular intervals. Thismethod of information collection is particularly important for applications which make heavy use ofother resources than the processor. For example, I/O-bound applications are more demanding onnetwork storage devices. Thus, their main energy consumer is not the processor, but rather theseI/O devices (i.e., the network card and the actual remotely accessed storage device). Of course, thismethod requires additional hardware, which is not always available to use.

Aside from energy information, information about performance is often also provided by the pro-cessor. Usually, this information consists of the amount of executed instructions and the instructionsexecuted per clock. This is useful to determine how much work a processor does in detail.

All this information about energy, power, and performance is on the scale of a whole machineand is thus impacted by all the software which is executed on the machine. Although there areother factors which can affect this information, such as the OS and system services, the applicationhas usually a major impact on energy and performance. Thus, the performance information at thehardware level provides not only a view of the system, but also how the application affects attributessuch as energy consumption or the system’s performance.

After showing how configuration is performed and information is collected at this level, thefollowing chapter describes the next higher level, which has more abstract methods and provides amore abstract view of a machine.

3.3 Operating-system Level

The second-lowest layer in a system is the OS level. Generally, this level abstracts the underlyinghardware and provides access to higher level software components. Resources such as processes,threads, storage devices, and network are also managed by the OS. Of special interest is themanagement of processes and threads provided by the OS. Processes or threads are requestedfrom the OS in order to provide resources for applications. Consequently, the information about allprocesses and threads is always kept up to date by the OS, as it is responsible for allocating them todifferent applications.

29

3.3 Operating-system Level

Configuration

At this level, two aspects of configuration are of interest. The first configuration option is adjustingthe amount of CPU cores available to applications which are currently running on the OS. Thiseffectively disables CPU cores, as no work is put onto a core when it is not used by the OS toexecute applications. This also impacts the available performance in a major way, because feweroperations can be executed concurrently. However, disabling CPU cores also lowers the system’senergy consumption. Disabling cores prevents cores from executing instructions and they requireno additional power, because they are not in operation. Consequently, dynamic energy consumptionis reduced to zero, but static energy consumption is still present. Usually, CPU cores are deactivatedwhen the workload is so low that the performance provided by fewer cores is sufficient to processthe workload. When the workload increases again, previously deactivated cores are reactivated andmore operations can be performed concurrently.

The other configuration option is pinning of threads to CPU cores. Pinning threads to CPU coresmeans that a thread is set to be executed on a certain CPU core. This process is also called setting athread’s affinity. Thread pinning is particularly important in the context of disabling CPU cores. Asdisabling cores removes them from the pool of available CPU cores, threads are then often assignedto other cores which are still active. When a disabled core is reactivated, most threads will stillbe assigned to other CPU cores. Even though the reactivated core is also available again, threadsare not scheduled to be executed on that core. Aside from that, thread pinning can improve theefficiency at which threads are executed. By pinning a thread to a CPU core, that thread will almostalways be executed on that core. This reduces context switching, as threads are not moved to otherCPU cores as often as before. Furthermore, spreading threads over all processor cores can be used toequalize load across all processor cores. Besides, certain threads can be pinned to a dedicated CPUcore in order to prioritize that thread’s execution, while other threads are assigned to the remainingcores. These two tools enable powerful and fine-grained control over OS resources.


The OS level provides information about a system’s load and running processes and threads. Asprocesses and threads are created by requesting them from the OS, this information is kept andmaintained by the OS. This is useful to determine how much load a system is experiencing fromthe OS’s point of view. Naturally, the OS is also impacted when configuration tools limit resources.This can be tracked by checking the load values generated by the OS. Similarly to information fromother levels, the system load indicates the level of use the total available resources of a machine areexperiencing. That can be used to determine whether additional resources are required to guaranteesmooth operation or limitation of resources can be imposed to lower the energy consumption.Additionally, the system load can be separated by individual processes and threads as well as theirimpact on the system’s total load.

After going over configuration and information on the OS level, the following section deals withthe platform level, which bridges the gap between the OS and the application.

3.4 Platform Level

In this work the platform is the environment in which the application is executed in. The fundamentalplatform used for ReApper is the JVM. It provides an additional abstraction layer between the OSand the application. The purpose of this layer of abstraction is to provide independence from theunderlying architecture. Furthermore, threads are also abstracted and managed by the platform.

30

3.4 Platform Level

As mentioned in the previous section, processes and threads are requested from the OS. Nativeapplications use OS interfaces directly to allocate additional processes or threads and manage themthemselves. In this context, native applications are applications written specifically to target an OSand compiled for a specific architecture. In contrast to native applications, applications written forthe JVM use interfaces of the JVM to request threads. The requested threads, however, are managedby the JVM, so that developers can use them through a more abstract point of view. This relievesthem from having to deal with different interfaces on different OSes. Furthermore, the JVM canalso manage threads as a pool of resources, which the application can use to submit tasks to.

Akka could also be viewed as being part of the platform level. However, in the context ofconfiguration and information monitoring, Akka is not a platform in itself, as it only provides aframework and abstraction upon which the application is built. Here, the platform level is understoodas an environment which actually manages resources and enables execution. Akka and by extensionthe ReApper platform provide a logical platform which adds the abstraction of actors as a computationunit. Consequently, Akka is neither responsible for allocating resources nor responsible for the actualexecution of actors. Nonetheless, Akka uses tools provided by the JVM and the Java standard libraryto commission the execution of its actors. Nonetheless, tools like executor services which are usedby Akka, are configured at the platform level.

Configuration

The configuration of threads is also called thread management in this work. Thread management isused to control the amount of threads available to the application which is running on the platform.A large number of threads can deteriorate the performance of the application, as the frequency ofthread switching increases. This in turn is done so that each thread can make progress. However,a low thread number causes concurrency to decrease, which can also impact the application’sperformance. As resources and workloads change constantly, the optimal number of threads changesas well. Maximum capacity of resources can be utilized by readjusting the number of threads toachieve performance improvements. This kind of thread management is particularly important inconnection with deactivation of processor cores. When processor cores are deactivated, the numberof context switches increases, as the same number of threads are scheduled to be executed on alower number of cores. Consequently, core deactivation and reactivation often requires threadmanagement additionally. Finding the optimal number of threads is a complex task and it is oftenhighly dependent on the application of interest.


Although process load in general is already collected by the OS, the platform provides this informationas well. As the platform forms an abstraction layer between the OS and the application, access tothe OS’s information is not always possible. Furthermore, access to that information often differsbetween OSes. This load information is thus usually collected on the platform level, although it isgenerated on the OS level. In addition to load information, the platform also records the numberof all active threads. Even though thread numbers are controlled using thread management, thereare often additional threads which are not handled by thread management. This information canbe useful to determine whether performance is affected by a large number of threads which arenot managed. Libraries or non-actor code parts often use threads of their own, which also have animpact on the performance. Despite the small amount of information on this level, it is nonethelessuseful to gather this information in order to improve the accuracy with which the configuration isdetermined.

31

3.4 Platform Level

After dealing with the platform on which the application is executed on, the next section showshow configuration and information monitoring is handled on the last level which is the applicationlevel.

3.5 Application Level

The application level deals with configuration of and information provided by the application. In thiswork, applications are understood as software which is run on a server or distributed on multipleservers. They offer services, process data or information, or are part of a system which provides aservice for end users.

Furthermore, the workload of these applications is not constant, but changes during the ap-plication’s execution. Consequently, there are many different metrics which are used to measurethe performance and quality of service of an application. The performance of some applicationsis measured by the amount of processed requests in a certain time, while other applications aremeasured using the total amount of work performed by them. In addition to that, other quality-of-service aspects are also important properties of an application. For instance, latency is essential forservices which have user interaction or serve content to users, as high response times lead to usersabandoning or leaving the service. Performance and quality of service are substantial to applications,because they affect an application’s success significantly. That is the reason for collecting data aboutthe whole system’s state which is used by the middleware platform to adjust resources while keepingthe application’s performance at adequate levels.

In order to use the ReApper middleware platform, cooperation of application developers isrequired, because applications are expected to implement interfaces which are used by the platformto configure the application and access data about performance and quality of service. This alsoresults in certain properties which should be incorporated into the design of an application forthe ReApper platform. Naturally, the application has to implement the actor model, as only actorprograms are supported by the platform. Almost all processing or computation should be modeledas actors and interaction between those actors should adhere to the principles of the actor model,which was explained in Section 2.1. The application should also be structured in a certain way anddivided into actor groups, which is used by the middleware platform to configure the application.Additionally, it should be possible to dynamically add or remove actors. This quality is used in orderto scale the application. In addition to that, actor groups and actors should be marked when theycan be moved from one machine to another without restrictions. This is essential for actor migrationand is described in Section 4.6.1. Aside from marking movable groups and actors, the applicationdeveloper can add hints about the effects on energy and performance when moving a certain actorgroup. This information is used for determining the order in which actors and actor groups aremoved. Consequently, these aspects might seem like issues which the application developer hasto take into consideration during the design process and implementation of an application, butoftentimes only minor changes are needed and many of those aspects are meant as a guidelinerather than fixed requirements.

Configuration

The application level offers two ways of configuration. The configuration performed on this levelis more complex and has a larger impact on the application’s execution than the methods on theprevious levels. In addition to the introduction of the configuration methods, benefits, advantages,and costs are also explained in greater detail.

32


Configuration Methods The first option is configuration of application actors by adding or removingactors, which basically scales the application. As the amount of actors affects an application’sperformance and energy consumption, actor scaling is a valuable tool in order to fine tune thoseproperties. Increasing the amount of actors can be used to improve qualities like throughput,as more actors are made available which can perform additional work. However, too manyactive actors can drop an application’s performance, as they increase the number of messagesin the system and thus also the amount of consumed resources. This means that there is anoptimal amount of actors, similar to an optimal number of cores or threads, which utilize justenough resource to process the applied workload. For that reason, the number of actors isadjustable and matching the number of actors to the workload contributes to reducing energyconsumption and to utilizing the resources provided by the configuration on lower levelsoptimally. The second configuration option is actor migration which is this work’s main focus.This approach assumes that actors can be moved from one machine to another. Therefore,moving an actor to another machine lets the moved actor use that machine’s separate poolof resources. Migration can be used to scale an application’s performance by adjusting theamount of actors and machines. During times of high workloads, additional workers arecreated and placed on additional machines in order to increase available resource for theapplication. The increase in resources and performance is accompanied by an increase inenergy and power consumption, as more machines are running in order to provide additionalresources. When workload decreases again, previously added workers and machines arestopped and removed from the application. Naturally, this also decreases energy consumption,as the added machines are put into idle mode or powered down.

Configuration Costs Even though scaling can be achieved with this method, migration introducesvarious forms of additional costs. One major cost factor is transfer of data and informationduring migration. As actors may contain state, migration of an actor includes moving thatstate, as well. The time and resources required to move an actor’s state heavily depends onthat state. There is the cost to transfer a state from one machine to another, which increaseswith the size of that state. In addition to transfer, serialization is performed in order to enablethat transfer. This uses processing time and power which is not available to process workloadwhile a transfer is being performed. After the transfer is finished, each message which is sentbetween actors on different machines incurs the same costs for serialization and networking.However, keeping the amount of messages small minimizes these costs, but requires carefullydesigned and implemented applications. Preferably, communication inside an actor group ishigh, while communication between actor groups is kept at lower levels. This also impactswhich actors or actor groups are moved, as that has a large influence on the amount ofmessages which have to be serialized.

Benefits of Migration at the Application Level Nonetheless, migration can be used to save energy.This is achieved by exploiting the heterogeneity of hardware used to execute the application.At low workloads, hardware with high energy efficiency is used to execute the application.This class of hardware was called low-range in Chapter 2. The kind of hardware used for lowworkloads achieves higher energy efficiency, as it has lower static and maximum dynamicpower consumption. However, this also limits the maximum performance which this kind ofhardware is able to reach. At high levels of workload, stronger machines with lower energyefficiency are used. These machines have higher static and dynamic power consumptions,but provide enough computing resources so that even high workloads can be handled. Thistype of hardware was referred to as high-range earlier. For workloads in-between low andhigh workloads, hardware which provides a compromise between performance and energy

33


efficiency is used for execution. This group of hardware was described as the mid-range classin the previous chapter. Despite not having the highest energy efficiency among the hardwareclasses, it provides an acceptable level while still achieving higher levels of energy efficiencythan the most performant hardware class. Furthermore, machines of the low- and mid-rangeclasses can be combined to handle workloads which cannot be managed by the mid-rangeclass by itself. While these kinds of workloads can be handled by the high-range class withoutproblems, using a combination of low-range and mid-range hardware might be more energyefficient despite additional running costs caused by migration. Consequently, migration isutilized not only to scale an application, but also lower an application’s energy consumptionby exploiting heterogeneous hardware.


As mentioned at the beginning of this section, information on the application level is vital, becauseit reflects how well an application is performing and reveals the level of quality provided by thatapplication. At this level, information about the application’s performance is collected by ReApper.This performance information serves as orientation and basis for the application configuration.

Performance Information The performance information is an indicator for the workload which iscurrently applied to the application. This information is used for the selection of a configuration.Different configurations are able to handle different levels of workload and the performanceinformation narrows the set of configurations down to the ones which are viable at theindicated workload level. The process or method for generating and measuring performanceinformation depends heavily on the application. Because of this, the middleware platformrelies on the application developer to provide this information.

Quality of Service Aside from the information about performance, information about other quality-of-service aspects is also collected, as this shows how much user experience is affected andwhether it is still at acceptable levels. As mentioned earlier, latency is a measure for qualityof service. High response times can be very harmful to user satisfaction and in turn anapplication’s overall quality, while low response times improve a user’s experience and raisesatisfaction with an application. Throughput is another quality and affects a user’s experiencewith a service substantially as well. This has an impact on the number of users which a serviceor an application is able to serve concurrently. An application’s maximum throughput limitsthe number of concurrent user requests; additional user requests which exceed that number arerejected. As a result, the throughput should always be high enough in order to accommodateall user requests. This avoids having to reject user requests, which has negative effects onuser experience and satisfaction. All of this application-level information has a substantialinfluence on an application’s success and is therefore essential for the platform’s informationarchitecture.

3.6 Conclusion

This work’s ReApper platform reduces energy and power consumption by employing configurationand information monitoring on different system levels. The hardware level allows fine-tuning ofcomputing resources such as the processor. At the same time, information about the performanceand energy consumption of those resources can be collected at this level. The OS level enables

34

3.6 Conclusion

configuration of a system’s abstracted resources such as processes and OS threads. These resourcesare allocated by applications or platforms, which are executed on top of the OS. Furthermore, the OSlevel provides information about the system’s performance from the view of the OS, as it managesall running processes of a system. The platform level embodies another abstraction layer on top ofthe OS level. It allows configuration of threads, which are managed by the platform, but used by theapplication to perform computation. The information on this level shows the load levels from theplatform’s view. The topmost level offers configuration of the application structure and distributionof the application. This level also provides vital information about application workload for theReApper platform, resulting in its particular importance for this work.

The information from the four introduced levels is constantly monitored in order to ensure thatcertain monitored values such as the application’s throughput do not exceed or fall below certainthresholds. Furthermore, this information provides the basis for decisions concerning the system’sconfiguration. The configuration methods on the different levels provide the tools which are used tolower energy and power consumption, while keeping key values such as the application performanceat a satisfactory level.

After explaining how ReApper’s implementation was designed, the next section describes theprerequisites of the implementation and the implementation itself in detail.

35

4I M P L E M E N TAT I O N

This chapter presents the implementation details of ReApper, after its architecture and design havebeen presented in the previous chapter.

Firstly, prerequisites which are required to understand the implementation are introduced. Someaspects of Akka are described in more detail, as those aspects were used together with the conceptsintroduced in the previous chapter to implement the ReApper platform.

Secondly, the implementation of ReApper on a single machine is described. An overview of allimplemented components and their relation to each other is given. Then, each implemented compo-nent is introduced, starting with the ReApper actor system, which is a modified version of Akka’sactor system. After that, the configurator and its subcomponents are presented. The configuratoroffers the configuration methods, which have been introduced in the previous chapter. However,these configuration methods are implemented by several smaller self-containing components, whichare encapsulated by the configurator. Then, the gatherer and its subcomponents are described. Thegatherer provides the previous chapter’s information-monitoring methods, which are used to trackcertain system values such as workload or energy consumption. The gatherer is also a wrapperfor several smaller components, which actually provide the implementation of the informationmonitoring methods.

Lastly, the implementation of the distribution of the ReApper platform is explained. The distribu-tion is performed using several of Akka’s built-in functionality. However, these functionalities areextended by an actor migration mechanism, which allows ReApper to move the application betweendifferent machines. This ability is also used to exploit the properties of heterogeneous hardware,which allows ReApper to lower the application’s energy consumption even more.

4.1 Prerequisites

As shown in the previous chapter, configuration and information monitoring is performed onvarious levels with a wide range of different tools. While processes on the lower levels are mostlyimplemented with tools independent from Akka, the platform and application levels are closelyintertwined with the Akka toolkit. Some configuration tools and methods on those levels exploitfeatures or aspects of Akka which were meant to be used for other purposes. However, in order tounderstand how those tools and methods have been implemented, the following sections describesome of those aspects in greater detail.

37

4.1 Prerequisites

4.1.1 Actor Structure

The concept of actors is implemented by separating an actor’s public interface and its internal state.The actor class which implements an actor’s internal state extends the UntypedActor class in Javaand the Actor trait in Scala. However, those actors cannot be instantiated directly, but are createdby the actor system, and this creation can be requested by providing a property object (Props object)to the actor system. This property object contains the information required by the actor system toinstantiate an actor which includes the desired class, parameters for that class’ constructor, andadditional deployment information. The property object can be thought of as an immutable recipefor creating an actor. In addition to the property object, a name can also be provided, which canbe used to retrieve and identify the actor at a later time. Akka does not depend on the defaultconstructor and custom constructors can be provided, which are used by Akka to instantiate actors.These constructors can also be used to inject dependencies for an actor. Instantiated actors arestarted asynchronously as soon as their construction is finished. Extending (Untyped)Actor forcesthe child class to implement the receive method. This method is called to process the messages anactor receives and is part of an actor’s behavior.

4.1.2 Actor Behavior

In Akka, the behavior of an actor mainly consists of processing incoming messages, which is performedwhen the receive method is called. The application developer decides which and how messagesare going to be processed in the receive method. During execution of receive, the actor state canbe accessed and also changed or replaced. These decisions are made depending on the actor’s stateduring message processing. As messages are not filtered by Akka before they are put into mailboxes,filtering is performed by the receive method. The type of messages is not restricted. Thus, theyare of Object type, however a message’s actual type can be checked with Java’s instanceof orScala’s isInstanceOf[Type]. All incoming messages cause a call to the actor’s receive method,but messages can be dropped by this method freely. This means that processing of all messages isnot required. It is recommended to use a communication protocol between actors to describe whichtypes of messages an actor is able to process and which types of messages are supposed to be sentto certain actors. During message processing there are several methods which are offered by Akka’sActor API [21] providing additional information during processing:

• getSelf returns the actor’s own ActorRef, which can be included in a response message.

• getSender returns the sender’s ActorRef of the message which is currently being processed.This can be used to reply to the sending actor.

• getContext exposes contextual information for the actor and the message which is currentlybeing processed. This information includes factory methods to create child actors (actorOf),methods to access the encompassing actor-system instance, and methods to get references tothe parent or supervisor and the supervised children (among other methods).

Nonetheless, there are some system-internal messages, which are always handled by the defaultimplementation of the actor interface. System-internal messages include suspending, restarting,and killing an actor.

4.1.3 Actor Supervision

Child actors can be created by any actor. All child actors are supervised by the creating actor andmanaged as a list in the creating actor’s context. Any actor can create additional children or stop

38

4.1 Prerequisites

existing children at will. Creation and termination are performed asynchronously without blockingthe parent. Fundamental to Akka actors is the delegation of tasks and supervision of child actors.This is also fundamental to ReApper’s migration mechanism, as movement of actor groups involvesshutting down said group before it can be moved to another machine. Instead of shutting downeach actor individually, which would take a long time and is often not possible because of highlycomplex actor hierarchies, Akka’s supervision functionality is used to shutdown a group’s supervisor.This causes the underlying hierarchy to be shutdown in an orderly manner. The supervision strategyis used to determine how a failure of a child is handled by the parent actor. It applies to all childrenand is final after it is has been set during creation of that actor. Upon experiencing a failure (i.e.,throwing an exception), the child actor suspends itself and its subordinates and notifies its supervisor.

Supervision Strategies The supervisor or parent actor has several options to handle such a situation.Firstly, it can resume the child actor keeping the child’s current state. Secondly, the supervisorcan restart the child clearing the child’s state. Thirdly, the failing child is stopped permanently.Lastly, the parent can escalate the failure to its own parent which causes itself to fail. Resumingan actor always entails resuming all its children, while restarting an actor always entailsrestarting all its children. This also applies to termination of an actor and is part of Akka’sdefault behavior. Escalation of failure propagation can continue until it reaches one of theguardians which are special actors managed by the actor runtime.

Guardians There are three guardians in total which are created during the startup process of theactor runtime system. The /user guardian actor is the supervisor of user-defined actorscreated by using the actor system directly. If the /user guardian terminates, all user-definedactors are also terminated. The /user guardian’s strategy can be configured, but if failuresare escalated further than the /user guardian, the whole actor system is terminated. Thesecond guardian is the /system guardian. It is used for system-related actors such as actorswhich are used for logging purposes. When the /user guardian terminates, the /systemguardian follows shortly after that. Indefinite restarting is the supervision strategy whichis used for all top-level system actors for all types of Exceptions (except for some specialtypes of Exceptions). All other throwables are escalated, which will cause the whole actorsystem to shut down. The root (/) guardian is the top-most actor in the whole hierarchy. Itssole responsibility is to supervise the other guardians. The supervision strategy employed bythe root guardian is the stopping strategy which terminates the child upon encountering anyException. Other throwables cause the whole actor system to shut down. The supervisionstrategy of the root guardian is fixed and cannot be changed to any other strategy except forthe stopping strategy.

Actor Monitoring Aside from supervision, it is also possible to monitor other actors without beingrelated to them. This is called lifecycle monitoring. Lifecycle monitoring allows actors to benotified when another actor’s state is changed. However, this is restricted to the transitionfrom alive to dead, as other state changes are not visible to actors aside from the supervisor.In order to receive a notification about another actor’s death or termination, the monitoringactor registers this interest with its ActorContext by specifying which ActorRef it wantsto monitor. When the specified actor dies, a built-in Terminated message is sent to themonitoring actor. This message will be delivered even if the specified actor has already beenterminated before the monitoring request was sent. This kind of monitoring is useful whenactors have a dependency on each other without being directly related with each other andenables them to adjust their behavior accordingly in case of a failure or deliberate termination.

39

4.1 Prerequisites

4.1.4 Actor Lifecycle

Figure 4.1 shows the lifecycle of an Akka actor. The ActorPath represents a “place” which might beinhabited by an actor. All ActorPaths are initially empty except for the system actors and guardians.Every time an actor is created (with actorOf), an actor instance is created simultaneously which isassigned to a path. This starts an actor’s lifecycle. A path together with a randomly assigned UIDidentifies the actor incarnation. The actor incarnation is the structure which manages the mailboxsemantically. Upon calling actorOf, the actor’s preStart hook is executed. This method can beoverwritten and is used to execute operations before the actor is allowed to process messages. Whenan actor is restarted, only the actor instance is replaced. The incarnation, UID, and path remainthe same. When an actor is stopped with the stop method, context.stop, or a PoisonPill, thepostStop hook is called and all that actor’s watchers are notified of its termination. Unlike objectreferences, an actor does not get destroyed when its reference count reaches zero. When an actor’s

Figure 4.1 – Lifecycle of an actor [22]

40

4.1 Prerequisites

reference is not known to other actors, it is also not receiving messages anymore. Therefore, actorsare not executed anymore and fall into an idle state, when their reference count reaches zero. Inthis state, they may receive messages but are effectively laying dormant. These idle actors are onlyoccupying memory, but they are not using any computing resources anymore. The postStop hookis used to clean up an actor. It can be overwritten in order to perform actions when an actor isstopped or terminated. This is where resources like database connections or files are closed orfreed. After processing the postStop hook, the path is free again and a new actor can be createdon this path. An actor cannot be resurrected once its life has ended and a newly created actor onthe same path will have a different UID, which signals that this actor is a distinct entity from theprevious inhabitant. An actor’s lifecycle ends after the postStop has finished execution. This featureis exploited to allow migration mechanism of the ReApper platform to detect whether its actorshave been shutdown properly. Furthermore, actor lifecycles work differently with the custom actorreferences used for migration in ReApper, as these actor references add another layer on top of theregular lifecycle. However, actors with migration capabilities also adhere to this lifecycle, but thereare some differences in semantics.

4.1.5 Actor Systems

The ActorSystem class is the main interface for actors in Akka. Furthermore, it provides a contextfor actors and manages resources and services which are used by actors.

Configuration When an ActorSystem is created, a Config object is used to initialize those re-sources and services. The configuration contains values to adjust characteristics like log-ging, remoting, serializers, and dispatchers. During the construction of an ActorSystemthe user-defined application.conf file is read. This file is then merged with a bundledreference.conf and any value not set by application.conf is added from reference.conf.The combined configuration file is passed to the ActorSystem afterwards. Then the Actor-System is created according to the configuration set in those files. The parsed configurationfiles results in a Config object, which contains the parameters of the parsed files. Aside fromthe configuration file, configuration parameters can also be set programmatically, before theConfig object is used to create a new ActorSystem.

Dispatcher The dispatcher is the component which implements and offers an ExecutionContext.An ExecutionContext is similar to java.util.concurrent.Executor and provides an in-terface which can be used to execute arbitrary code like Futures or Threads. Usually, Akkauses the default dispatcher defined in reference.conf, which is by default a “fork-join-executor”. Other than the “fork-join-executor”, there is also the “thread-pool-executor” whichis implemented by java.util.concurrent.ThreadPoolExecutor. In addition to that, Akkaalso provides two specialized dispatchers. The PinnedDispatcher dedicates a unique threadfor each actor using it, while the CallingThreadDispatcher runs all actor invocations onthe current thread without creating additional threads. Aside from that, underlying Executor-Services can be further tweaked by setting values for properties like the minimum size, themaximum size, or the growth factor which determines the amount of threads created andmanaged by an ExecutorService. When an actor is created, the default dispatcher is used toexecute that actor. It is also possible to declare additional custom dispatchers to which actorscan be assigned on creation. However, these additional dispatchers cannot be dynamicallyadded to an already existing actor system.

41

4.1 Prerequisites

Actor Providers Even though the ActorSystem provides the interface for creating actors, thecomponent which is actually providing this service is the ActorRefProvider. The Actor-System delegates the creation of new actors to the ActorRefProvider. Furthermore, theActorRefProvider manages all the guardians and the dead letters mailbox. Actors arecreated by using the actorOf method, which is offered by the ActorSystem and implementedby the ActorRefProvider. The public actorOf method takes the following parameters: aproperty object, which was described earlier, and an optional name, which is used to name theactor. Additionally, the Props object may contain information about an actor’s deployment.The deployment information is used to signal whether an actor is created remotely.

4.1.6 Remoting

As described in Chapter 2, remoting is a central part of Akka’s actor implementation. This sectionadds explanation of some aspects of remoting which exceed the fundamentals of Akka but are usedand needed to understand the implementation of ReApper’s actor migration.

Remote-actor Creation There are two methods which can be used to put an actor onto a remoteactor system. The first method is configuration-based and requires declaring remote actorsprior to execution of an application. The other method is to add deployment information tothe Props object during the creation of an actor. The deployment information is enclosedinside a Deploy object and contains a description of the remote host on which the actors isgoing to be deployed on. In addition to that, the Props object used to create the remote actorhas to be serializable, as this information is transfered to the remote host and used to createthe actor on the remote machine.

Remote References and Paths Remote actors cannot be reached with ordinary ActorRefs, butthey use a special type of ActorRef instead. In contrast to local-only scenarios, the distributedscenario uses a different kind of ActorRefProvider which is able to create RemoteActor-Refs. The RemoteActorRefProvider creates actor references with regard to remoting and isusually not used in a local-only system. Aside from denoting remote deployment of an actor,RemoteActorRefs have ActorPaths which behave differently in a distributed scenario.

Remote-message Delivery Figure 4.2 shows message delivery between two actor systems on dif-ferent machines. The system on the left-hand side is called sys and listens on port 2552 ofhost A, while the system on the right-hand side is named B and listens on port 2552 of host B.The parent actor is part of the actor system on host A and the child actor lives in the system onhost B. Both are instances of LocalActorRef inside their home system, but are representedby a RemoteActorRef on their remote system. These RemoteActorRefs route messages tothe local actor reference on the actor’s home system. Logically there is no distinction betweenremote and local actors and the actor path shows the supervision hierarchy (red arrows inFigure 4.2). The physical path reflects the remote nature of these actor references by adding aremote daemon actor in-between the actor systems (green arrow in Figure 4.2). Both actorsystems contain a remote daemon, however only the remote daemon of actor system B is ofrelevance for the example depicted in Figure 4.2 and thus, actor system A’s remote daemon isnot shown.

42

4.1 Prerequisites

ActorSystem

akka.tcp://sys@A:2552

LocalActorRef

user

RemoteActorRef

child

LocalActorRef

parent

ActorSystem

akka.tcp://sys@B:2552

Remote Daemon

remote

ActorPath

child

ActorPath

parent

ActorPath

user

ActorPath

sys@A:2552

LocalActorRef

child

RemoteActorRef

parent

Machine A Machine B

routes to

routes to

.children .parent

.path

Logical actor path: akka.tcp://sys@A:2552/user/parent/child

Physical actor path: akka.tcp://sys@B:2552/remote/sys@A:2552/user/parent/child

Figure 4.2 – Akka remote path (based on [23])

4.2 ReApper Middleware Platform

The ReApper platform’s implementation provides a middleware platform for actors. The implementa-tion uses various tools to adjust energy consumption to the applied workload. Energy proportionalityis the main goal during execution, but other aspects like performance or lowest energy consumptioncan also be focused. Furthermore, the energy efficiency of distributed actor applications is improvedto a greater extent, as migration of actors is utilized to move actors to more energy-efficient machines.

The platform is implemented in Java and uses the Akka toolkit as its implementation of the actormodel. In order to use the ReApper platform, applications can either use Java or Scala, as Akkaprovides APIs for both languages. Additionally, applications written for the Akka toolkit can alsobe used with the ReApper platform. However, using Akka applications requires some minor codechanges, as the middleware platform uses modified versions of Akka’s actor system. Many toolsused for configuration and information monitoring are written in C/C++ and use native code whichperforms low-level calls to the OS or hardware. Other configuration tools are provided by the OSand are not directly available in the JVM. All of these tools have been wrapped in Java interfaces inorder to provide their functionality to the JVM. Their functionality is bundled into modules whichare added to the Akka actor systems. In addition to native tools, some configuration is performed inAkka itself which has been modified to allow this configuration to happen. Thus, most configurationcomponents are independent from Akka and only additions, but some configuration tools are deeply

43


integrated into Akka itself. Moreover, the distribution capabilities of Akka are exploited in order toenable migration, because Akka has no support for manual migration of actors.

This section illustrates the implementation of the ReApper middleware platform. First, anoverview of the implemented platform is displayed. The overview presents the components of theplatform and the relationships between each of them. After that, each component is described indetail. Their roles, functionalities and the tools, which are used as their basis, are explained. Lastly,details of the implementation of ReApper’s actor migration with Akka are given.

Overview

The implementation’s structure is illustrated in Figure 4.3. The ReApperActorSystem is the plat-form’s core component. It enriches the ActorSystem of Akka with additional components andprovides the runtime environment for the application. All functions and methods provided by theActorSystem are also provided by the ReApperActorSystem. Moreover, the ReApperActorSystemhas the same class as ActorSystem without using inheritance, because the modifications are per-formed through configuring Akka’s ActorSystem class. Thus, the ReApperActorSystem is only asemantic wrapper for the modifications performed by the ReApper platform and no actual class withthis name exists. Therefore, the ReApperActorSystem can directly replace ActorSystems in code,although the initialization process differs. Because of that, applications written for Akka can bemodified to be used with the ReApper platform by replacing the initialization call for Akka’s actorsystem for ReApper’s actor system. However, additional functions, which are only provided by theReApperActorSystem, need further modifications to the application code.

The main components which have been added to the ReApperActorSystem are the Configuratorand the Gatherer. The Configurator contains several submodules which are used for configura-tion on various system levels and aspects. Furthermore, it is hooked into the ReApperActorSystem’s

ReApperActorSystem

ApplicationActorSystem

(Default Akka)

ManagerActorGatherer Configurator

Dispatcher

Custom ExecutorService

uses

executes replaces

executesuses integrated

accessesaccesses

Figure 4.3 – Structure of customized Akka in the ReApper platform

44


dispatcher and replaces the underlying execution context with a custom ExecutorService. Allprocesses and tools used for information collection are centralized as the Gatherer. It is structuredsimilarly to the Configurator and contains different submodules which provide the ability to collectdata and information from various system levels and aspects of the system. Those two componentsprovide their functions publicly and can be used by any entity in the same ReApperActorSystemincluding actors. Alternatively, their functions are also provided by the ManagerActor which dele-gates function calls to the Configurator and Gatherer. The ManagerActor is implemented as aSystem Actor. It processes configuration requests by calling configuration methods provided by theConfigurator and information requests by replying with information returned by the Gatherer.

Moreover, the platform can also be distributed. The overview of the distributed system is shownin Figure 4.4. Every machine in the distributed scenario contains its own instance of an ActorSystem,ManagerActor, Gatherer, and Configurator. These components are separate and only work onthe machine on which they are located on. All machines share the same application, which isrunning on top of the ActorSystem. Usually, the application consists of several actors which worktogether to provide a service. In this distributed scenario, the application’s actors are running onboth machine A and machine B. These actors are part of the same application, but do not necessarilyfulfill the same function. Moreover, these actors can be moved between different actor systems inorder to either scale an application or save energy. The ReApper platform provides a migrationmechanism, which is directly built into the ReApperActorSystem for that purpose.

Because the ReApper platform is architecture independent, the hardware, which is used toexecute the platform and application, can differ from machine to machine. This independence isused to lower energy consumption, as properties of heterogeneous hardware are exploited andapplication’s actors are moved. Furthermore, the configuration on each machine can also vary. Byseparating a machine’s configuration from the application, each machine can be controlled in afine-grained manner and adjusted to match different requirements.

Despite all ActorSystems being on separate machines, all machines communicate with eachother in order to exchange information about their own system’s state and configuration with the helpof each system’s ManagerActor. This information can be used to determine how an application’sactors should be distributed over all available systems.

4.3 ReApper Actor System

The ReApper Actor System is the environment used by the ReApper platform to execute actor appli-cations. It uses Akka’s default ActorSystem as its base and adds several components in order toprovide configuration and information monitoring capabilities. Furthermore, compatibility withAkka’s default ActorSystem is guaranteed by adding the modules with configuration instead ofusing class inheritance. However, the custom actor system is initialized using a separate class andthis requires changes in application code, as initialization of the ActorSystem differs. All calls toActorSystem.create have to be replaced with either ReApperActorSystem.create or with us-age of the ActorSystemBuilder. The ReApperActorSystem class provides factory methods whichcan be used to create different types of ActorSystems. There are options for a local-only customactor system, a distributed custom actor system, and a remote default actor system. The remotedefault actor system is used to start additional actor systems where the application is initially notstarted at and it enables migration to that system. In addition to the factory class, there is also theActorSystemBuilder, which can be used to configure a system manually.In order to configure a system manually, an ActorSystemBuilder is created. Then the configurationcan be set using ActorSystemBuilder’s set methods. As the builder pattern is used in this class, an

45


Machine A

Application

ActorSystem

Manager

Gatherer Configurator

Machine B

Application

ActorSystem

Manager

Gatherer Configurator

Migration

Figure 4.4 – Overview of the distributed ReApper platform

ActorSystem can be created using the build method. The returned ActorSystem can then be usedlike a regular ActorSystem and the configuration set by the builder is applied to the ActorSystem.The custom actor system’s configuration contains six elements:

Name The name serves the same purpose as it does for Akka’s ActorSystem class. It is used as anidentifier for the actor system. This is useful to denote an actor system’s purpose, to indicatewhich application is running on the system, or to mark some property or other information.The name is set with setName and takes the name as a String parameter.

Config The Config is the same configuration class that is used by Akka. It serves the same purposeand wraps configuration parameters set in application.conf and default.conf which areused by the actor system to initialize modules and set certain values. A Config object can alsobe set using the setConfig method, in case programmatic configuration needs to be added.

Executor The executor provides the execution context which is used to execute Java threads. Thesethreads in turn are used to execute the application’s actors. Akka offers several different execu-tor services by default. However, these are only configurable before starting the applicationand provide a static amount of threads during the application’s execution. In order to allowreconfiguration during runtime, a customized executor is passed to the builder before theactor system is created. Options for executors are a customized ThreadPoolExecutor anda customized ForkJoinExecutor. These custom executors can be accessed by the system’sconfigurator, which will use that to adjust the amount of threads. The executor is set withsetExecutor and requires either the fully qualified name of the ExecutorService class orthe ExecutorService class itself as a parameter.

Actor Groups When actors are created, they can be assigned to an actor group. Actor groups sharea common executor which is used to execute all actors of one group. Usually, actors use thedefault executor and share that executor’s threads with all other actors. An actor group with

46


its own executor limits thread sharing, but introduces another thread pool which managesthreads and requires processor time for execution. The setGroups method takes an array ofStrings as its parameter. Each String in that array is used to create an executor and thesame String is used to denote an actor group during actor creation. The actor group is addedto the Props object which is used to create an actor. Calling withDispatcher with the groupname as parameter adds the actors created with that Props object to that actor group.

Manager The manager provides access to configuration and information-monitoring functionswithin an actor system. It is implemented as an actor and makes these capabilities availableto other actors. Moreover, configuration and information can also be controlled from aremote system in case remoting has been activated. Even though access of configuration andinformation is also provided by other means, it is recommended to use the manager actor.Furthermore, the manager actor is also capable of periodically collecting a system’s informationand it can log this information to a file. The manager is activated with the setReApperManagermethod which prompts the initialization of a manager actor during actor-system creation.

Remoting ActorSystems do not support distribution and remoting by default. However, this fea-ture can be activated before creation of an actor system. Activation of remoting results inreplacement of the default ActorRefProvider with a customized one. In ReApper’s imple-mentation, the MigratingActorRefProvider is used, as it allows manual actor migration incontrast to the RemoteRefProvider, which only enables distribution and remoting. Nonethe-less, both can be set as a system’s ActorRefProvider. The setRemoteActorRefProviderresults in initialization of a RemoteActorRefProvider. This method takes a port number asits parameter which can then be used to connect to this system from a remote system. ThesetMigratingActorRefProvider results in initialization of an MigratingActorRefProvider.Because the MigratingActorRefProvider uses the RemoteActorRefProvider as its base,this method also takes a port number as its parameter. A detailed explanation of the migrationmechanics follows in Section 4.6.1.

4.4 Configurator

The configurator serves as the system’s central interface for configuration concerns. It contains andmanages subcomponents which provide actual configuration methods and delegates all configu-ration calls to those components. Figure 4.5 shows configuration aspects and their realization ascomponents.

Number of Threads The amount of threads available for the execution of actors is configuredthrough the ThreadConfigurator. This component holds information about executors andthe number of threads which are available to that executor. Because actor groups use theirown executors, the ThreadConfigurator holds references to all executors which have beeninitialized in a system. Each executor is configured individually and only affects the actor groupto which the executor is assigned to. Actual configuration is performed through customizedExecutorServices, which provide additional methods to get and set the number of availablethreads. As reconfiguration during runtime is not provided by ExecutorServices of thestandard Java library, the implementation of ReApper modifies these ExecutorServices toprovide this additional functionality. This feature is available on all systems regardless ofhardware or OSes.

47

4.4 Configurator

Thread Pinning The ThreadPinner provides methods to pin threads to processor cores, which isdone by setting a thread’s affinity to be executed on a certain processor core. As the systemcontains several thread pools, the ThreadPinner allows pinning of threads of any thread poolmanaged by an actor system. Any threads or thread pools managed externally, are not affectedby the ThreadPinner. Threads can be pinned to cores individually by providing a mapping ofan actor group’s threads to processor ids. In addition to that, threads can be pinned to a numberof cores instead of directly pinning them to processors. Other applications or the OS are notaffected, because only platform and application threads are pinned. The ThreadPinner usesan OS tool, called taskset, in order to perform thread pinning. This tool is wrapped by ascript, which is included in the ThreadPinner, and called by the ThreadPinner when threadpinning is performed. Thread pinning is usually available, as most OSes support it and thereis no hardware limitation.

Active Processor Cores Activity of processor cores is managed by the CoreController. It allowsactivation and deactivation of processor cores and manages information about active andinactive CPU cores. The files in the sysfs which are used to control the CPU are manipulated inorder to adjust the amount of cores. The control file is provided by the Linux kernel and usuallyused for CPU hot-plugging. As this is done in on a system-wide level, other applications andthe OS are also affected by this measure. This results in more load for the remaining processorcores, produced not only by the actor application but also by other processes running on thesystem. CPU hot-plugging during runtime is not always available on other OSes, which canrestrict usage of this configuration option.

Power Limit Setting a power cap on the processor reduces its power consumption to a certainvalue. Power capping is performed by the PowerCapper. It provides methods to set andget the processor’s current power limit. The limit applies to the processor and thus impactsthe performance of other applications and the OS, too. The PowerCapper uses Intel’s RAPLinterface, which is accessed through a native library. The native library is implementedseparately from the platform and is loaded when the platform and application are started.However, RAPL is only available for Intel processors, but other hardware is starting to adoptthis technology so that this feature probably becomes more accessible on other hardware.

4.4.1 Thread Configurator

Thread configuration is provided by the ThreadConfigurator class. In order to enable this con-figuration option, a customized executor needs to be set before the actor system is created. Theexecutor is then initialized by Akka during creation of the actor system. Akka does not directlyuse the executor’s class for initialization, but requires a configurator object of ExecutorService-Configurator type. All classes involved in executor initialization are illustrated in an overview inFigure 4.6. This configurator is used to initialize an ExecutorServiceFactory, which provides thecreateExecutorService method. This method returns a concrete ExecutorService. The factoryrequires an identifier parameter which is used for all executor services created with this factory.Usually, these factories are used to create an executor service once only. This results in a factory foreach actor group, as each actor group uses its own executor. Additionally, the ReApperExecutor-ServiceFactory is used as the base class for the platform’s custom executor-service factories. TheReApperExecutorServiceFactory manages all derived executor instances in a static map. Thismap is accessed by the ThreadConfigurator to adjust the number of threads of a certain executoror actor group.

48

4.4 Configurator

ReApperActorSystem

Configurator

ThreadConfigurator ThreadPinner CoreController PowerCapper

Thread Number Thread Affinity Active Cores Power Cap

ExecutorService Taskset Sysfs Files RAPL-Library

Figure 4.5 – Configurator architecture

Each type of executor has its own ExecutorServiceConfigurator and ExecutorService-Factory. The ReApper platform implements two types of executors, a ForkJoinPool and a Thread-PoolExecutor. The ReApperForkJoinExecutor is created with the ReApperForkJoinExecutor-Configurator and the ReApperThreadPoolExecutor is created with the ReApperThreadExecutor-Configurator. The ReApperForkJoinExecutor contains an instance of a customized ForkJoin-Pool, as some modifications had to be made to the ForkJoinPool in order to allow resizing duringruntime. In contrast to that, the ReApperThreadPoolExecutor limits itself to extending Java’sdefault ThreadPoolExecutor. However, both executor services implement the ReApperExecutor-Service interface which provides methods for getting and setting the executor’s number of threads,as well as a method to retrieve the common thread name used by the executor’s threads. Gettingand setting the number of threads of an executor is used by the ThreadConfigurator to resizethread pools. The common thread name is used by the ThreadPinner to pin threads. Two types ofexecutors were implemented, as each of them serves a different purpose and has different properties.This makes one of them superior in certain scenarios, while the other excels at the execution ofother kinds of applications.

ForkJoinExecutor The ForkJoinExecutor uses a customized ForkJoinPool to execute threads.An instance of a fork–join pool is managed by the ForkJoinExecutor. The ForkJoinPoolis a more recent executor service and was added to Java’s runtime library in 2011. Its dis-tinguishing feature is work-stealing [53]. Threads in a ForkJoinPool attempt to find andexecute tasks submitted to the pool, but they will also execute tasks created by other activetasks. Furthermore, the ForkJoinPool is suited for event-style tasks as well. Tasks in Akkaare also created in an event style, as a received message triggers processing of said messageand creates a task in order to do that. This makes the ForkJoinPool a fitting executor forusage in Akka.The setMaximumPoolSize method provides the interface which is used by the Thread-Configurator to readjust the amount of threads. When the number of threads is changed,the executor’s current pool is shutdown and all incomplete tasks, which have been submitted

49

4.4 Configurator

ExecutorServiceConfigurator

• used to create executor factories

• contains Akka configuration

ReApperThreadExecutorConfigurator

ReApperForkJoinExecutorConfigurator

ExecutorServiceFactory

• used to create executor services

• contains identifier used forexecutor-service creation

ReApperExecutorServiceFactory

• manages all executor instances

• base for custom executor-servicefactories

ReApperThreadExecutorFactory

ReApperForkJoinExecutorFactory

ExecutorService

• provides and manages threads foractor execution

• separate for each actor group

ReApperExecutorService

• provides methods for setting andgetting thread number

• get method for executor-threadnames

ReApperThreadPoolExecutor

ReApperForkJoinExecutor

creates

implements

produces

extends

extends

extends

creates

creates

produces

produces

extends

extends

implements

Figure 4.6 – Executor initialization

to that pool, are saved in a temporary list. Then a new ForkJoinPool with adjusted capacityis created, which is fed all previously drained tasks. Thus, thread reconfiguration should notbe performed too frequently, as that can affect executor performance severely. Aside fromthe pool, the ReApperForkJoinThreadFactory is used to ensure that the maximum numberof threads is always respected. It manages all active and available thread identifiers. This isused to determine whether new threads can be created or not. The thread factory assignsa thread identifier from a pool of available identifiers to newly created threads. When thepool of available identifiers is empty, no additional threads can be created. Upon terminationof a thread, it notifies the factory and returns its identifier to the pool. This notificationmechanism is implemented in ReApperForkJoinWorkerThread. This type of thread is alsocreated by the ReApperForkJoinThreadFactory when new threads are requested by theForkJoinPool. The ForkJoinExecutor generally provides higher performance than theThreadPoolExecutor, but reconfiguration is more expensive.

ThreadPoolExecutor The ThreadPoolExecutor has been part of Java’s default library since 2004∼[54]. It works by collecting all submitted tasks in a queue that is polled by a pool of threadswhich executes these polled tasks. The ThreadPoolExecutor is created with a corePoolSizeand a maximumPoolSize. The corePoolSize determines how many threads are kept in thepool during normal operation. The maximumPoolSize determines the maximum amount ofthreads which can be used by the pool when the task queue is full. The ReApperThread-ExecutorService extends Java’s default implementation of the ThreadPoolExecutor. For-tunately, dynamically increasing the size of the thread pool during runtime is supported by theThreadPoolExecutor. However, shrinking the thread pool during runtime in a swift manneris not implemented.The ReApperThreadExecutorService modifies thread creation and thread managementin order to enable thread pool shrinking. Similarly to the ReApperForkJoinExecutor, theReApperThreadExecutorService also employs a mechanism to manage thread identifiers.Different sets of identifiers are saved and used during resizing to determine which threads aresupposed to be stopped and which threads can continue their execution. Growing the threadpool is done by calling setCorePoolSize with a higher number of threads as parameter. Thiscall is delegated to Java’s default ThreadPoolExecutor which performs the resizing. Shrink-ing the thread pool is performed by the same method and handled by the custom executor itself.

50

4.4 Configurator

This is done by throwing a custom exception after the running thread has finished its task andwas determined to be terminated. In order to handle this custom exception, threads createdby the ReApperThreadExecutorService use a special exception handler which terminatesthe thread and ends exception handling at that point.

4.4.2 Thread Pinner

Thread pinning is performed by setting a thread’s affinity to be executed on certain processor cores.This causes the OS’s scheduler to execute a thread on a designated core or even cores. In doing so,an application’s performance can be improved, as threads are moved to other processor cores lessoften. This results in fewer memory-copy operations, which can be become a time and resourceintensive process when threads are moved frequently in a short time. Although the OS’s schedulertries to minimize this as much as possible, manually pinning threads to cores can yield higherperformance. The reason for that is that it is possible to determine which threads are related andshould be executed together. This information is usually not available to the OS, but it is availableon the platform and application levels.

The ThreadPinner class provides thread pinning capabilities to the ReApper platform. It offerstwo different methods for thread pinning, pinThreadToId and pinThreadToNum.

Pinning with Identifiers The pinThreadToId method pins a thread to a certain core which isidentified by a number. For instance, the identifiers of the processor cores of a four-coreprocessor are 0, 1, 2, and 3. Usually, pinning to identifiers is used in conjunction with amapping of executor threads to cores. Another use for this method is pinning all threadsbelonging to an executor to a certain core or group of cores. This can be used to executethreads of different actor groups separately which could result in a lower number of contextswitches. Additionally, this can also be used to reserve one or more cores for exclusive use byan executor or actor group.

Pinning to Number of CPUs The pinThreadToNum method is used to pin threads in a rougher way.Instead of pinning threads to a certain identifier, threads are pinned to a number of CPUs.This method expects a number and a group identifier as its parameters. The group identifieris used to identify threads which should get pinned by the ThreadPinner. The number tellsthe ThreadPinner the amount of processors to which the threads are going to be pinnedto. This method of thread pinning is useful for applications which do not use multiple actorgroups or when a more fine-grained assignment is not required. These applications can justdesignate a number of desired processors and its threads are evenly pinned to that number ofCPUs afterwards. Apart from pinning, the ThreadPinner also keeps record of the number ofcores to which an executor is pinned to. This information can be retrieved with getNumCpu,which takes an actor group identifier as parameter, or getAllNumCpu, which returns a mapcontaining group identifiers with their corresponding number of CPUs.

External Script and Tool The ThreadPinner calls an external script in order to actually pin threads,as thread affinity and thus thread pinning is managed by the OS. The used script executesa tool called taskset which is part of most Linux OSes. Taskset is used to set and retrieveCPU affinities of running process. It uses process identifiers (PIDs) to pin the correspondingthread to a list of CPUs. This list may also only contain one entry which indicates that theprocess should be pinned to that CPU. Even though PIDs are used, threads can also be pinnedwith this tool, because Linux assigns PIDs to (Java) threads as well. As information aboutPIDs is required to run taskset, a script is used to gather this information before taskset

51

4.4 Configurator

is called. The PID of the JVM in which the platform is running can be retrieved inside theThreadPinner. This PID is then given to the script which uses jstack in order to find allthreads belonging to the JVM associated with this PID. In addition to that, the names of anexecutor’s threads are also given to the script. This information is used to filter the threadsaccording to thread names, as many threads are part of the JVM process, but only some ofthem are of interest. The ThreadPinner gets this information from the ThreadConfigurator.This is also where the common thread name of the ReApperExecutorService interface isused to identify all threads belonging to an executor. The script’s last step is to pin each threadPID according to the parameters given to the script.

4.4.3 Core Controller

The core controller is used to disable and activate processors. Deactivating a processor preventscode execution from being performed on that CPU, which in turn lowers power consumption of thatprocessor or part of that processor. The implementation of the ReApper platform uses a feature ofthe Linux kernel called CPU hotplug [55]. The objective of CPU hotplug is assignment of processorsto VMs, which removes the CPUs from the host system, and physically adding or replacing CPUsduring operation. This was used to replace faulty processors without having to power down themachine, for instance. Another use was to add CPUs to a multi-socket. As server processor arevery high-priced, systems were often not fully fitted with processors. However, the performance ofthose systems could be improved by adding processors in case more resources were required. Thisnecessitated CPU hotplug support by the OS and this features was thus added to the Linux kernel.Originally, CPU hotplug was created for use with single-core processors. However, it also supportsmulti-core processors, as each processor core is treated as a CPU in Linux.

Usage of CPU Hotplug ReApper uses CPU hotplug mostly for managing CPU cores on a singleprocessor rather than for systems with multiple processors. Thus, this component is called corecontroller, despite the fact that the original feature supports both CPU and core management.Unfortunately, disabling individual cores only lowers power consumption by a moderateamount. This puts the deactivated core in a quasi permanent idle state, but all other partsaside from the deactivated core still need to be powered. A deactivated core’s part of aprocessors dynamic power consumption is thereby saved, but processor components, whichare shared between processor cores, still cause most of the processor’s static power consumption.Disabling a whole processor can have more impact on a machine’s power consumption, asthis allows the system to cut power supply to that processor [42]. This removes almost allof that processor’s static and dynamic power consumption from that machine’s total energyconsumption.

Interface The interface for configuring the amount of active processors is provided by the Core-Controller class. It offers methods to active and deactivate CPUs based on their identifier.The method for activating processors is called activateCore and requires a CPU identifieras parameter. Deactivating CPUs is offered with the deactivateCore method, which alsorequires a CPU identifier as parameter. These identifiers are the same as the CPU identifiersused by the ThreadPinner and taskset for thread pinning.

OS Tool The CoreController uses Linux-kernel functions for CPU hotplugging in order to manip-ulate the number of active CPUs. The Linux kernel provides a control file for each CPU. In thefile system they are usually placed at /sys/devices/system/cpu/cpu{id}/online. Eachof those files contains either the value 1 or the value 0. If a control file contains 1, the CPU

52

4.4 Configurator

belonging to this file is active and can be used to execute code. A CPU is inactive when itscontrol file contains the value 0. The CoreController holds references to all CPU onlinefiles and manipulates those files to control active and deactivated CPUs. Because these CPUonline files are usually protected by the OS, the platform and application need permissionsto be able to write to those files.

Information about Active Cores In addition to controlling cores, the CoreController also pro-vides methods to check the current status of cores. The getActiveCores method returns aset containing identifiers of active CPUs and the getInactiveCore method returns a set withidentifiers of disabled CPUs. This can be used to check for active processors before using theThreadPinner to pin threads to a CPU. However, not all OSes have the ability to dynamicallydisable and reactivate processors or cores during runtime. That means that core controlling isdependent on certain OS functions and is only supported when the OS possesses a featurewhich is similar to CPU hotplugging of the Linux kernel.

4.4.4 Power Capper

The power capper allows the platform to limit a processor’s power consumption. This method forcesthe CPU to lower its performance and power consumption in order to comply with a limit, whichcan be set by the user. It is the most direct method to lower a system’s power consumption andyields the highest potential for energy saving. Intel’s RAPL interface [56] is used to set power limitsand thereby lower energy consumption. Unfortunately, this feature is currently prevalently foundon Intel processors and CPUs of other manufacturers usually do not support a mechanism withsimilar features or granularity. However, this technology is starting to get adopted by other hardwaremanufacturers and this restriction might get smaller with future CPU generations. Other methodslike DVFS can be used in a similar manner, but they do not allow the same level of precision. This

Package 0

Core 0 Core 1

Core 2 Core 3

Last Level CachesMemory Controller

Graphics

Memory

DIM

M0

Memory

DIM

M1

Memory

DIM

M2

Memory

DIM

M4

Package 1

Core 0 Core 1

Core 2 Core 3

Last Level CachesMemory Controller

Graphics

Memory

DIM

M0

Memory

DIM

M1

Memory

DIM

M2

Memory

DIM

M4

Package Power Plane (PKG0 and PKG1)

Core Power Plane, all Cores on a PKG (PP0 on PKG0 and PP0 on PKG1)

Graphics Power Plane, Client only (PP1 on PKG0 and PP1 on PKG1)

DRAM Power Plane, Server only (DRAM)

Figure 4.7 – RAPL domains (based on [56])

53

4.4 Configurator

whole process of power limit is also called power capping in this work, as a cap is placed on theprocessor’s power consumption, hence this component is called power capper.

Interface Power limiting or capping in the implementation of ReApper is provided by the Power-Capper class. It offers methods to set and get the processor’s current power limit. ThegetCurrentLimit method returns the currently set power limit as a watt value. The set-CurrentLimit method can be used to set a new power limit on the processor. These methodsdelegate their calls to a native library, which manipulates special processor registers in orderto get and set those limits. An architecture check is performed to ensure that the native libraryis only loaded, when the platform and application is executed on compatible and supportedhardware. The native library uses the RAPL interface to configure and read information fromthe processor.

RAPL Domains RAPL divides the processor into different domains. This division is illustrated inFigure 4.7. A processor made up of two packages is depicted and each of those packagescontains four cores numbered from 0 to 3. Additionally, each package has a memory controllerand a graphics component. System memory is located outside of the processor, but thememory is directly connected to the processors. There are four different domains supported byRAPL. The package domain is framed in blue and it contains all components on the processorincluding the processor cores, memory controller, and graphics component. As there are twopackages in Figure 4.7, there are also two package domains – package domain 0, abbreviatedas PKG0, and package domain 1, abbreviated as PKG1. The core power plane is another RAPLdomain and a yellow frame is used to mark the components included by this domain. Thisdomain is also called the PP0 domain and it only contains the cores of a processor. The thirdtype of domain is the graphics power plane, which is also called PP1 domain. It is taggedwith a red frame. The PP1 domain only contains a processor’s graphics component, in casethe processor has one. This domain is only available on processors designated for client-use,as server processors usually do not contain a graphics unit. The last type of domain is theDRAM power plane or also DRAM domain. Figure 4.7 uses a green frame to highlight thisdomain. It only includes system memory and does not take the processor into account atall. This domain is only available on server processors and client processors do not provideaccess to this domain. Each domain apart from the DRAM domain is separated by package.Consequently, Figure 4.7 contains two package domains, which each contain a PP0 and a PP1domain. Thus, in order to access a processor’s cores or graphics unit, both the domain and itspackage need to be specified. Power limits can be set for each of those domains independently.However, as PP0 and PP1 domains are included in the PKG domain, all limits imposed on thePKG domain also affect its included domains.

Native Library jRAPL The library, which provides access to RAPL, uses native methods. Nativemethods are defined and available in Java, but are implemented in a more machine-orientedprogramming language. This capability is feature of the Java language and it is called JavaNative Interface [58]. It allows access to native code from inside of the JVM. Because RAPLcannot be accessed in the JVM, the implementation of the PowerCapper was split into twoparts. The RunningAveragePowerLimit class provides access to RAPL functions in the JVMby using the jRAPL library [42], which implements how RAPL is actually accessed. Severalmethods are implemented to get and set power limits. The RAPL class allows getting andsetting the power limit of PKG and PP0 domain separately. Access to the power limit of theother domains was not implemented, because their support is not guaranteed on all machines.However, access to those limits can be added with a few additional methods, as the native

54

4.4 Configurator

PKG Power Limit #1

0141516

Time WindowPower Limit #1

17232431

PKG Power Limit #2

32464748

Time WindowPower Limit #2

495563 5662

Enable Limit #1

PKG Clamping Limit #1

LockEnable Limit #2

PKG Clamping Limit #2

PKG Power Limit

0141516

Time WindowPower Limit

17232430313263

LockEnable Limit

Clamping Limit

Figure 4.8 – Machine-specific registers for RAPL power limiting, the top figure shows the registerfor PKG and bottom figure shows register for other RAPL domains (based on [57])

library contains functions to manipulate those limits, as well. The native library is writtenin C++ and it uses Intel’s RAPL library, which is provided on their website [56]. The powerlimits are manipulated by accessing a special type of CPU register, called MSR.

Machine-specific Registers for RAPL The structure of the MSR for power limiting is illustrated inFigure 4.8. RAPL differentiates power limiting for the package domain, which is shown atthe top of the figure, and power limiting for all other domains, which is shown at the bottomof Figure 4.8. This register has a length of 64 bits. The partition of those bytes depends onthe targeted domain and is described in [57]. The partition of the MSR for PKG domain issummarized in Table 4.1:

The partition of the MSR for the PP0, PP1, and DRAM domains is similar to the partition forthe PKG domain, but the second limit is omitted in the partition for the other domains.

Intel’s RAPL library provides functions to read from and write to those registers. In additionto that, higher-level functions are provided, which abstract the low-level access of getting

Bits Function Explanation0-14 Package Power Limit #1 Sets the average power limit of the package domain

(corresponding to time window #1)15 Enable Limit #1 Power limit #1: active (1) and inactive (0)16 Package Clamping Limitation #1 Enables going below processor state setting of the

OS during time window #117-23 Time Window for Power Limit #1 Time window for power limit # 132-46 Package Power Limit #2 Sets the average power limit of the package domain

(corresponding to time window #2)47 Enable Limit #2 Power limit #2: active (1) and inactive (0)48 Package Clamping Limitation #2 Enables going below processor state setting of the

OS during time window #249-55 Time Window for Power Limit #2 Time window for power limit #2

63 Lock Restricts access to this register to read-only

Table 4.1 – Explanation of MSR in Figure 4.8 [57]

55

4.4 Configurator

and setting power limits. These functions are used by jRAPL instead of manipulating CPUregisters directly. Before the jRAPL library can be used, it is first initialized, as the nativeimplementation requires initialization. Consequently, the library has to be shutdown properly,because memory used by the library has to be deallocated and file handles and devices needto be cleaned up. This is also reflected in the Java interface which is implemented as asingleton to prevent concurrent access to RAPL, as concurrent access could lead to incorrector faulty behavior. After writing a power limit to the register, its effects take place almostimmediately. However, the power limit cannot be set arbitrarily low or high, as there is aninternal limit to the power limit. This internal limit depends on the processor model and thereis no documentation about it. Consequently, it can only be found by experimentation andtesting different power limits using an external measurement device.

In contrast to the other presented configuration methods, power capping is less dependent onthe OS. However, in order for the power capper to use RAPL, it needs access rights to the MSR fromthe OS. This is done in Linux by granting read and write rights for /dev/cpu/*/msr and the Linuxcapability CAP_SYS_RAWIO to the JVM. This Linux capability allows the platform to open the deviceswhich are needed to access the MSR. However, this method depends heavily on the employedhardware, as a functionality similar to the one provided by RAPL is often not available by other typesof processors. Nonetheless, RAPL and power capping provide a powerful tool for controlling powerconsumption of the CPU, which takes effect quickly and provides more fine-grained configurationcompared to other methods like DVFS.

The configurator and its four components provide a wide variety of configuration options in asystem. However, finding an energy-proportional or performance-oriented configuration requiresinformation about the system’s state. This information is provided by the gatherer, which is introducedin the next section.

4.5 Gatherer

The gatherer combines different components and modules used to retrieve information aboutdifferent aspects of a system. The information collected by the gatherer serves as the base fordeciding on the system’s configuration. It is gathered by monitoring information sources fromdifferent system levels. These pieces of information are then combined and depending on that theconfiguration is adjusted for different purposes or goals. Afterwards, this information monitoring iscontinued in order to ensure that the new configuration achieves the purposes and goals, whichwere set earlier. It is also used to check whether the constraints, which can be specified alongsideof the goals, are satisfied. Furthermore, the collected information can be utilized to improvethe application by recognizing disadvantageous configurations and avoiding these configurationsaltogether. The gatherer uses submodules to provide a unified interface, which is used by theplatform to collect information about the system state. It offers information about performance,energy and power, system and process load, and application metrics. The gatherer is implementedin the ReApperStatGatherer class. It manages and initializes the information sources, which areimplemented as different types of information sources. Each source provides a different bit ofinformation and their information is provided by the ReApperStatGatherer to other parts of theplatform. There are four different types of information sources and each type with its respectivemethod of information collection is illustrated in Figure 4.9.

Performance Information This type of information shows a system’s performance on the lowestlevel in a system, that is the hardware level. It reads performance information from the

56

4.5 Gatherer

ReApperActorSystem

Gatherer

PerformanceCounter

Monitor (PCM)

Running AveragePower Limit

(RAPL)

SystemInformation

ApplicationInformation

PerformanceInformation

Energy/PowerInformation

System/ProcessLoad

ApplicationMetrics (QoS)

PCM Library RAPL LibraryOperatingSystem

MXBean

Interface

provided byApplication

Figure 4.9 – Gatherer architecture

processor and reflects how much work it has performed. This information contains the amountof instructions executed by the processor since the system was started. Aside from that, it alsocontains the number of instructions which are executed per clock (i.e. instructions per clock(IPC)). As the processor is shared by all running applications and the OS, this information’sscope is system wide.

Energy and Power Information Energy and power measurements are collected in order to getinformation about a system’s current and past energy and power consumption. One methoduses the RAPL interface to gather data about energy and power consumption of the processor.This information is divided into different domains, which gives more detailed informationabout the distribution of the processor’s energy consumption. In addition to energy informationof the processor, an external power measurement device is used to collect data about a wholesystem’s power and derived from that energy consumption.

System and Process Load This type of information reflects a system’s load on the OS and the JVMprocess level. Load information is useful to show the share of utilized resources out of allavailable resources, which are provided by the OS. This can be used to recognized systemoverload or resource limits. Apart from that, it can also be used to recognize performancebottlenecks caused by the application when an application’s performance remains static eventhough more resources are available. The load information of the JVM process can serve asan indicator for interference of other application’s on the JVM’s performance. This would bethe case, when system load is high even though the application’s process load is low. In thiscase, the application’s performance is hampered by another process.

Application Metrics Information about an application’s performance and other QoS properties arecalled application metrics. This information is directly provided by the application and thus itrequires cooperation of application developers. However, there is no universal form or measure

57

4.5 Gatherer

of performance, because the work which is performed by applications can differ from eachother significantly. Finding a reasonable measure for this is the responsibility of the applicationdeveloper. Other metrics aside from the application’s performance can also be specified andthese metrics are then used for additional constraints for the system configuration.

4.5.1 Performance Information

The data about information is provided at a very low-level and close to the CPU. It is recordedby the processor and can be read from special processor registers. This feature is only availablefor Intel processors and is called performance counter monitor (PCM) [59]. It works similarlyto RAPL, as both manipulate MSRs. Intel’s PCM provides a large range of different performancevalues, but two values are particularly interesting for the ReApper platform. These two values arethe amount of executed instructions and the number of instructions executed during a clock cycle.The other PCM values are considered, but data about memory might be useful for future use. Theinformation collected from the PCM shows how much work is performed by the processor in termsof raw instructions disregarding applications or the OS. This gives a view into the fundamentalperformance of a system.

Information-collection Method Because data is read from CPU registers, the module for PCM isalso split into a Java interface and a native library. This is similar to the implementation of themodule for RAPL. The Java class which is used to access the PCM is called JPCMStatSource. Itretrieves data about executed instructions and instructions per clock (IPC) from the processorand provides access to this data through its public methods.

Interface This data can be retrieved in two different ways. Either it is retrieved from a measurementover time, or the current performance values are read from the registers directly. The measure-ment over a certain time allows the system to capture the performance values generated duringthe measurement, while the direct method only yields data about the current state of thePCM values. The getCurrentInstructions and getCurrentIPC methods read the currentperformance values and return them directly. The measurement method uses two differentmethods in order to signal the start and end of a measurement. A measurement is startedwith the firstMeasurement method and ended with the secondMeasurement method. Thefinished measurement returns the amount of instructions executed between start and endof the measurement instead of the amount of instructions since the system was started. Inaddition to that, the average IPC during the measurement is also returned.

Native Library for PCM The native library for PCM is called jPCM and uses Intel’s PCM implementa-tion to access the MSRs for PCM. Intel’s implementation includes functions to read and writefrom MSRs, but it also provides higher-level functions to specifically read certain performancevalues. The jPCM library provides Intel’s higher-level functions to the Java interface and theJPCMStatSources uses those functions to collect performance information from the CPU.This information can be obtained by the platform and it can be used to affect the decision onthe system configuration.

Intel’s PCM is very dependent on hardware features, but it does not require a lot of support by theOS. This is similar to the constraints and requirements of RAPL. However, the information providedby this module gives a detailed view on the processor’s performance and forms thus a solid basis ofinformation. These performance values add a low-level perspective of the system’s performance,which can be used to estimate the system’s state in more detail.

58

4.5 Gatherer

4.5.2 Energy and Power Information

Information about energy and power consumption are another important source of information forthe platform. This information is used to put data and information about performance in relation toactual energy and power consumption. It allows the platform to connect levels of energy consumptionto performance levels. With this correlation, configurations can be selected, which both satisfyperformance requirements, while also using as little energy as possible in order to achieve thatperformance.

Information-collection Methods Energy and power information is collected with two differentmethods. Information about the energy and power consumption of the processor is gatheredwith Intel’s RAPL interface. These energy values can be further divided into individual valuesfor different CPU domains. The other type of energy information is gathered by an externalmeasurement device, which provides information about total energy and power consumptionof a machine.

Reading Energy Information from Registers RAPL energy values are accessed through the RAPLclass, which is also utilized by the PowerCapper to set CPU power limits. Energy values canbe collected from the CPU with two different methods. The first method retrieves currentenergy values and returns a set of energy values, which contains one value per domainexcept for the DRAM domain. The unit used for these energy values is Joule. These energyvalues reflect only the processor’s current energy count. In order to get energy consumptioninformation for a certain period of time, the second method for retrieving energy values canbe used. This method requires setting a start point and an end point for a measurement. ThestartMeasurement method is used to mark the beginning of a measurement and its end ismarked with endMeasurement. Then, the data about energy consumption between start andend is returned by the jRAPL library. After that, this information is passed to the platform bythe RAPL class.

Energy values are collected with RAPL by reading the MSRs. The partition of the MSR forenergy values is shown in Figure 4.10. The register is 64 bits long, where the first 32 bitscontain the energy value. The remaining bits are reserved. The energy value can wrap arounddepending on the rate of energy consumption, as this value is a 32-bit value. Because ofthis circumstance, it is recommended to retrieve RAPL energy values with the measurementmethod in order to avoid misinterpreting the energy count, as the processor’s counter resetsafter exceeding the maximum value.

External Energy Measurement Device The other source of energy information reflects the energyconsumption of the whole system. This also includes peripheral devices like hard-disk drivesor add-on cards which can not be captured with RAPL. An external measurement deviceis used to capture this information. It uses an MCP39F501 IC [60] produced by MicrochipTechnology. The purpose of this IC is to measure the input power of AC/DC power supplies.The IC is soldered onto a sample board, which is illustrated in Figure 4.11.

Total Energy Consumed

031

Reserved

3263

Figure 4.10 – Machine-specific register for RAPL energy status (used by all domains) (based on[57])

59

4.5 Gatherer

Figure 4.11 – Power measuring device using the MCP39F501 IC

This board has two kettle plugs at the bottom side and a USB port at the top. One kettleplug is used to connect the device to the power outlet and the other plug is the device to thepower supply of a computer. The system communicates with the measuring device throughits USB port. The measurement device is abstracted as a device file. Energy measurementsare initiated by writing to that file and the results can be received by reading them from thedevice file. This process is wrapped in a native library, which takes care of this measurementprocess and provides a higher-level abstraction for the platform in Java.

This library is implemented in C/C++ and is called jMCP. It offers methods to get currentpower values, as well as methods for performing a measurement. The MCP39F501 onlyprovides power values; energy values are derived from those values. When a measurement isstarted, the power is measured periodically. The measured power values are then averagedand this value is multiplied by the measurement time, which results in the energy consumptionduring the measurement time. Thus, the data provided by the measurement device contains apower value in watt, an energy value in joule, and the measurement period. Access to thislibrary in Java is provided by the MixedSignalEnergyMeasurement class. It offers the sameinterface as the RAPL class. However, instead of returning energy values for different domains,the MixedSignalEnergyMeasurement class returns an energy value, wattage value, and themeasurement time.

Measuring energy and power consumption of the processor is dependent on hardware support,but it yields accurate information about the processor’s energy and power consumption. Furthermore,even a separation of this information by processor subcomponents is provided. In addition to that,energy and power values are also available from the external measurement device. This methodprovides information on a system-wide level and is independent from support by hardware orsoftware. Both methods combined give a detailed view on a system’s energy consumption duringthe execution of an application and these methods create a solid base for energy monitoring andregulation of energy consumption.

60

4.5 Gatherer

4.5.3 System and Process Information

Information about CPU usage is another helpful piece of information for assessing a system’sperformance and state. For the ReApper platform, there are two types of load information ofinterest. The system load shows the amount of used processing resources and how much of theavailable resources are left unused. The other load information is the process-load information. Thisinformation displays how much resources are used by the process, which runs the platform andthe application. Both pieces of information together provide indicators on the nature of limitationsexperienced by the application. This can be used to determine whether stagnating performance iscaused by insufficient resources, the application itself, or other applications.

The Java Runtime Environment provides an interface for load information. This interface iscalled OperatingSystemMXBean and can be accessed by the platform to collect current load values[61]. This bean provides the system’s CPU load with getSystemCpuLoad and the JVM process’ CPUload with the getProcessCpuLoad method. The system CPU load is returned as a value, whichreflects the whole system’s recent CPU usage expressed as a value between 0.0 and 1.0. This valuerepresents the the percentage of the system’s CPU usage from 0 to 100 percent. The process CPUload is returned as a value with the same representation used by the system CPU load, but insteadof reflecting the system’s CPU usage, the JVM process’ load is reflected in that value.

Because access to this information is abstracted by the Java Runtime Environment, this infor-mation is not dependent on any hardware or software features. This information adds anotherperspective of the system’s performance and resources. Moreover, load information provides anotherpiece of context information, which can be used to estimate the system state and thus helps withdecisions on configuration.

4.5.4 Application-level Information

The information at the application level is fundamental for the ReApper platform, as it is used fororientation when configuration is performed. It is used to estimate the workload, which is currentlyapplied to the application. This workload determines the selection of a set of configurations, whichare able to handle the workload. However, there is no default form or unit which can be usedto measure an application’s performance. This measure is entirely dependent on the applicationand requires modeling by the application developer. Nonetheless, there are general examples forperformance measures, which are shared by different classes of applications. For example, onesuch performance measure is throughput. Throughput is a measure of the number or amount ofwork, which is performed by the application during a certain period of time. Concrete examples forthroughput are the measure of requests per minute or second, number of work packages or stepsperformed per minute or second, and the amount of concurrently served clients.

In addition to performance values, there are also other metrics, which define a service’s quality.These properties are also called quality-of-service (QoS). They can also be factored into a configu-ration, as they can significantly alter the user experience or the overall quality of an application.Examples for QoS properties are latency and error rate. Latency is the time an application takesto process a request and reply to the user. This property can severely affect a user’s experiencewith a service or application like computer games making it a major topic of interest for applicationdevelopers and researchers [62, 63]. The error rate measures how many requests or packets aredropped or lost. This can also be seen as an intensified form of latency, as a dropped or lost requestcorresponds to requests with infinitely high latency. Consequences of high error rate are timeoutsand in more extreme causes also abandonment of the application or service.

61

4.5 Gatherer

The ReApper platform cannot gather this information by itself and relies on the applicationdeveloper to provide this information. In order to make this information available to the platform,an information sources needs to be implemented by the application developer. This informationsource should provide access to the current application performance represented as a numericalvalue. Additional information about QoS can be provided in an additional method. During theapplication initialization, this information source should be registered with the platform in orderto make it aware of the application information. The platform uses the provided information asorientation to achieve performance or energy goals. Additionally, QoS properties can be used toconstrain the configuration in order to maintain a certain level of quality.

Because of the wide range of different applications, there are many different measures to usefor an application’s performance. This requires additional work by the application developer, butthe platform cannot work without a measure for application performance and for that reason theplatform has to be very intrusive on the application code and design here. However, configuration andinformation monitoring on one system is very limited especially when nowadays many applicationsrequire scaling and distribution on a massive scale. A method to configure distributed systems ispresented in the next section. This combined with the concepts and methods of the configuratorand gatherer can be exploited to facilitate energy saving in a distributed actor application.

4.6 Distributed Actor Platform on Heterogeneous Hardware

Distribution can be used to raise the limits on performance by adding machines and dividingprocessing and computing to those additional machines. It serves as another option for configurationon a level beyond that of individual machines, which has been described in the previous sections.Utilizing distribution also allows dynamic scaling of applications. When high workloads are appliedto the application, additional machines are used to handle the workload. These added machinesare removed again, when the workload decreases. This allows the application to provide enoughresources when the requirements increase and release them again when they are not neededanymore.

This process is similar to matching a machine’s resources to a certain workload with the help ofconfiguration on different system levels within a single machine. However, scaling with distributionis, contrary to the previous approach on a single machine, a more complex process, as its distributednature adds new issues and aspects. Distribution requires communication over a network anddata cannot be shared as easily as it is done in applications without distribution. Furthermore,communication adds additional sources for errors. This results in issues like message and processingdelays or loss of messages. Apart from errors, communication over a network is a more expensiveoperation compared to executing an application purely locally. However, these costs are acceptable,as they allow processing of much larger workloads, which could never be processed by a singlemachine. Thus, many developers pay the costs caused by distribution and provide services, whichare used by a large number of people regularly. This makes distribution a fundamental tool forapplications, despite its associated costs.

Fortunately, distribution and the actor model fit together very well. Distribution is easily incorpo-rated into the actor model, because the actor model restricts interaction to communication betweenactors. This means that actor applications are naturally designed for communication and no changesare required to support distribution. Consequently, distribution only needs to be added to the actorruntime environment, which is handling the communication process. Distribution is also built intoAkka, as it offers remote actors, which can be deployed on other machines. These remote actors

62


behave like actors on the local machine and communication between local and remote actors ishandled by the Akka runtime.

The ReApper platform exploits these distribution capabilities to allocate additional resourceson remote machines in order to handle higher workloads. Additional actor systems are started onremote machines, which are used to extend the execution environment of the application. Thisprocess requires configuration and information monitoring on two different levels. The local levelhandles configuration and collection of information on each machine, while the global level managesa global configuration and keeps a global application state. On the local level, resources of a singlemachine like power limits, threads, processor cores, and thread pinning are managed and localinformation like performance data, load information, and energy information is collected. Theglobal level handles the distribution of the application and uses the information collected by eachindividual machine to derive a global application state. Thus, goals regarding performance or energyare pursued as a distributed system, which uses the configuration of distribution as the tool toachieve that goal. This is done by dividing the workload into smaller chunks, which are distributed toother machines and handled on those machines. The configuration in this case consists of assigningthese chunks to different machines. Ideally, only as many additional machines as required are usedin order to handle the workload. In addition to matching workload to the amount of machines,utilizing heterogeneous hardware improves the energy efficiency even more, as it allows more precisematching of resources and workload. Actors can be placed on certain types of hardware which areespecially energy efficient at certain workload levels. However, a fine-grained actor placement ormigration mechanism is required for that.

A method for creating actor systems and actors on other machines is used to handle growingworkloads and support for this is provided by Akka’s remoting module. When workloads shrink again,the amount of actors should be decreased and actors should either get removed, or consolidated onfewer machines. This would release resources and in turn lower energy consumption. However,apart from remoting, there is no support for controlled movement of actors. Akka’s clusteringmodule offers automatic distribution of actors on a cluster of machines, but it does not allow resizingthe cluster during runtime, nor does it allow specifying the machine which should be used toexecute a certain actor. This made it unusable for the ReApper platform, which resulted in theimplementation of a custom mechanism for actor migration to provide manual actor movement.The implementation uses Akka’s remoting module and changes several mechanics of that moduleto enable actor migration. In addition to that, some inspiration was drawn from Akka’s clusteringmodule and its actor transfer mechanism to handle the transfer of actor states.

4.6.1 Actor Migration

The implementation can be divided into three different concerns. The first concern is the creationof actor references with migration support. This allows actors to be moved from one actor systemto another system without breaking existing actor references. The second aspect addresses therequirements for actors which enable migration of actors. The last aspect is the actual migrationprocess of an actor from one machine to another.

Figure 4.12 shows an overview of the components involved in actor migration. The Migrating-ActorRefProvider replaces the default actor-reference provider in the actor system. It extendsthe RemoteActorRefProvider and uses its super class’s methods to create default actor references.The default type of actor reference (InternalActorRef or RemoteActorRef) is replaced with theMigratingActorRef. These customized types of references contain a real actor reference and serveas proxies for default actor references. They contain Akka’s LocalActorRefs or RemoteActorRefs,which are used to pass messages to a real actor, as MigratingActorRefs do not refer to a real actor

63


MigratingActorRef

• Default type of actor reference• Provides capabilities for actor

migration• Proxy for a default Akka actor

reference• Contains LocalActorRef or

RemoteActorRef

DefaultActorRef

• Local or remote• Actual reference to an actor

Actor Instance

• Contains state and behavior• Located remotely or locally

MigratableActor<interface>

• Methods for state transfer• Handling of system messages

for migration

MigratingActorRefProvider

• Used as default ActorRefProvider• Delegation of actor creation to su-

per class RemoteActorRefProvider

RemoteActorRefProvider

• Used for actual actor creation• Base class for MigratingActorRef-

Provider

extends

creates

contains

references

implements

Figure 4.12 – Overview of custom components for actor migration in Akka

directly. Thus, the default actor reference types provided by Akka are used for communication, butthey are wrapped in a MigratingActorRef, which adds migration capabilities to actor references.Furthermore, this allows migration without invalidating references, as the proxy reference is alwaysvalid. The real references inside those proxies can change, when actors are migrated betweenmachines. MigratingActorRefs are type-compatible with regular actor references, because theyimplement the InternalActorRef interface. This allows replacement of default actor referenceswithout requiring large code changes. The references contained within a MigratingActorRef pointto actual actors, which are implemented almost like regular actors. They can contain a state and havea behavior which determines how messages are handled. However, some changes are required inorder to allow migration for such an actor. The actor needs to implement an additional interface andits state has to be serializable. The MigratableActor interface adds methods used for state transferduring actor migration. This is needed when an actor is migrated between different machines ifthe actor state is supposed to be migrated as well. In order to support this, methods for setting andgetting the actor state are required. They are used to extract, transfer, and reapply the state to themigrated actor. Furthermore, the application developer has to ensure that the actor’s state can beextracted as a Serializable, which is expected as a parameter during the actor’s state transferin the processMigratingMessage method. This same Serializable is also used to restore themigrated actor’s state with the setState method.

Migrating-actor References The default actor reference implementation of Akka does not allowactors to be moved from one machine to another machine. The ReApper platform replaces thisdefault implementation with a customized one, which is called MigratingActorRef. Regularactor references are bound to a location when they are created. This requires creation ofa new reference, when the underlying actor is moved to another machine. Moreover, allexisting actor references of that actor are invalidated by that process. The custom type of actorreference used by ReApper allows migration by serving as a proxy and redirecting messages toa regular actor reference. The MigratingActorRef contains a reference to the actual actorand forwards messages to that actor. When its underlying actor is migrated to another machine,the reference contained in the MigratingActorRef is replaced with a newly created actorreference to the actor on another host machine. This keeps the MigratingActorRef valid andmessages can be sent to that reference without interruption, while the contained actor can bemigrated without having any effect outside of the MigratingActorRef. In order to guaranteethat messages are not lost during the migration process, the customized actor reference buffersany received messages until the actor’s migration is completed. These messages are delivered

64


to the new actor, as soon as the actor’s transfer and creation is complete, but before messagesare redirected to the new actor.

MigratingActorRefs are created with the MigratingActorRefProvider, which replacesAkka’s default actor-reference provider. The MigratingActorRef contains all parameters,which are required to recreate the actor. This includes the Props object, the actor name, as wellas the original actor path. This build information is given to the MigratingActorRefProviderwhen the actor is migrated to another machine. With this information, the actor referenceprovider is able to rebuild the same actor on another machine. The MigratingActorRef thenredirects its received messages to this new actor. However, the MigratingActorRef onlyhandles references, but the actor implementation is also customized in order to support actormigration.

Actors with Migration Capabilities Generally, adding migration capabilities to an actor does notrequire radical changes to an actor’s implementation. It can keep most of its behavior andstate without needing large changes to those aspects except for implementing an additionalinterface called MigratableActor. However, there are two conditions which have to befulfilled in order to add migration capabilities.

The first condition is that the actor’s state has to be serializable, as the state is sent to theactor’s new location during its migration process. This transfered state is then applied to therecreated actor on the new host machine, before the actor can be used again. Consequently,the size of the actor’s state is also an important factor to consider, because the transfer timedepends on the actor state’s size. The longer this transfer takes the later the actor is able toprocess new messages and more messages are buffered, which further delays processing ofnew messages. However, this process can be skipped if the actor does not contain any state ordoes not need to keep its state after migrating.

The second condition is handling of additional system messages. These messages are calledSnapshotRequest and SnapshotReplay. They are used to extract an actor’s state and toreapply that state to the moved actor. The methods to handle these messages are part ofthe MigratableActor interface and are available through implementing that interface. Thismethod is called processMigratingMessage and has to be added to the actor’s behavior bythe developer. When a SnapshotRequest is received, the actor’s state is extracted by theprocessMigratingMessage method and returned to the actor-reference provider. The appli-cation developer provides the actor’s state as a parameter for the processMigratingMessagemethod. After receiving the actor state, the actor state is forwarded to the newly createdreplacement actor, which applies the transfered state to itself. Reapplying an actor state is alsohandled by the processMigratingMessage method. It uses the setState method to apply aprevious actor state to the actor. This is triggered by receiving a SnapshotReplayMessage. ThesetState method has to be implemented by the application developer, as there is no way togenerically set an actor state. Handling of these additional system messages cannot be omitted.If a state transfer is not needed, a null value can be given to the processMigratingMessagemethod and the setStatemethod can be implemented as an empty method. This also preventsthe actor-reference provider from sending a SnapshotReplay message to the replacementactor.

The employed logic for state transfer is borrowed from Akka’s clustering and persistencemodule. These modules provide a similar mechanism for state transfer, which can be used toeither persist an actor or to recreate it on another cluster node. However, Akka’s implementationis more sophisticated, because it uses a transaction log of messages in addition to a simple

65


state transfer. This allows recovery to any previous state, while ReApper’s state transfer workson a snapshot basis. A custom approach was chosen, as Akka’s persistence module capturedtoo much information and was slower. Furthermore, all received messages had to be reappliedto an actor before it was able to reach its most recent state, which increases the migrationtime.

Migration Process The custom reference and requirements for actors provide the basis for actormigration. However, the actual process of migration is handled by the MigratingActorRef-Provider. This reference provider creates MigratingActorReferences and manages allthose created references as well. Actor migration is requested from the custom actor-referenceprovider using its migrateActor method, which takes an actor’s logical path and a destinationsystem as parameters. Before an actor system can be designated as a destination, it has to beregistered first. This registration is performed with the registerSystem method, which takeseither an address string or an address object as its parameter. An additional identifier is usedto differentiate between remote systems. This identifier is also given to the migrateActormethod as a parameter to designate a destination for an actor’s migration.

The process of migrating an actor is depicted in Figure 4.13. It shows the migration ofan actor from a local actor system to a remote actor system. There are five componentswhich interact during an actor’s migration. The MigratingActorRef serves as a proxy andredirects any message it receives to an actor instance. There are two different types ofactor instances in Figure 4.13. The local actor instance is the original instance referencedby the MigratingActorRef, whereas the remote actor instance is the actor instance whichis used after the migration is finished. The local actor instance is part of the local actor

MigratingRef Local ActorInstance

Local ActorSystem

Remote ActorSystem

Remote ActorInstance

Checks (e.g., target system exists)

Create actor on target system

Create actor

Confirm creation

Lock message queue

Buffer messages

Request snapshot

Send snapshot

Replay Snapshot

Terminate actor

Confirm termination

Switch target

Deliver buffered messages

Figure 4.13 – Actor migration process

66


system and thus messages from local actors are forwarded to a local actor reference by theMigratingActorRef without requiring serialization. At the end of the actor’s migrationprocess, messages are forwarded to a remote actor reference, which points to the remote actorinstance. This remote actor instance is hosted on a remote actor system and messages sent tothe MigratingActorRef require serialization after the actor’s has finished migrating.

The process’ first step is performing various checks, which ensure that the migration can beperformed. This includes checking validity of the designated target system. A target system isvalid when it has been registered before the migration process is started. After passing thechecks, a new actor is created on the remote system. This results in a remote actor instancewhich has the same type and implementation as the original instance on the local system. Aremote actor reference to the newly created remote actor is then returned to the local actorsystem. After that, the MigratingActorRef’s message queue is locked in order to preventmessage processing of the local actor instance. All messages sent to the MigratingActorRefwhen the message queue is locked are buffered and kept until the end of the migration process.During this lock, a snapshot is requested from the local actor instance. This snapshot is then sentto the remote actor instance and this remote instance uses that snapshot to update its state tothe state of the local actor. When this snapshot transfer is complete, the local actor is terminatedby the RemoteActorRefProvider with a PoisonPill. Snapshot messages and the poison pillare not buffered by the MigratingActorRef and directly delivered to the local actor instance.The local actor system uses Akka’s actor-monitoring features (explained in Section 4.1.3) toget a notification about the local actor’s termination and waits until it receives that notification.Then, after the notification is received, the MigratableActorRef’s target is switched to theremote actor instance. At this point, all messages sent to the MigratableActorRef areredirected to the remote actor instance. The last step of the actor’s migration is to replayall buffered messages to the remote actor instance and unlocking the MigratingActorRef’smessage queue. After the switch of the target is finished, all received messages are directlyforwarded to the remote actor instance and the actor returns to its regular operation mode.

Differences in Behavior and Lifecycle Actor migration changes the behavior of actors with migra-tion capabilities compared to the one they exhibit using Akka without modifications. Additionalmessage delays can occur on top of the usual delay caused by Akka’s remoting, because mes-sages are buffered during the migration process. This can make actors respond slower andcause additional timeouts if that is not taken into consideration. Furthermore, actors capableof migration also have to process more messages, as system messages sent by the migrationmechanism need to be processed by the actor. These messages can affect the average pro-cessing time of actors significantly, because actions caused by those messages, such as statetransfer and reapplying of the actor’s state, can take a large amount of time if the migratedactor’s state is particularly large. In addition to those additional messages, actors are alsoterminated and recreated more frequently as they are moved between different machines.This can cause cleanup of actor resources, such as shutting down database connections ornetwork connections more frequently.

These terminations also alter the actor lifecycle, as shutting down an actor reference in Akka isnot equivalent with shutting down a MigratableActorRef. MigratableActorRefs continuetheir lifecycle even when their proxy target is shutdown and recreated on another machine,which results in a new lifecycle for its target but not for the MigrateableActorRef itself.However, when the proxy target is terminated without initiating a migration process, thenfurther usage of the wrapping MigratableActorRef is deemed unsafe. This behavior is thesame as the one displayed by regular actor references created by Akka.

67


Energy Hints and Marking Actor Groups The platform requires information from the applicationdeveloper about which actors and actor groups can be migrated. This enables the specifiedactors to be moved, as the platform is not able to determine these actors by itself. In additionto that, the developer can provide hints about the effects of migrating certain actors on theapplication’s energy consumption. This can be useful to decide the order of migration. Thereare different options to provide this information to the platform. Either it is included in theapplication.conf file, which is parsed by Akka during its initialization, or it is put into adescriptive document like an XML or JSON file. The information provided by the developershould include the actor classes’ names in order to indicate the classes of actors which canbe moved and yield energy savings upon migration. In addition to the actor classes, theirconcrete actor paths are required as well because they are used to identify the actor references,which are then moved by the platform. ReApper cannot automatically analyze this type ofinformation yet and for the evaluated applications in Section 5.1, it is manually determined.However, including this information is still useful, as this enables manual migration and canbe used later on when a component to analyze this information is developed.

The ReApper platform implements a custom migration mechanism, as Akka does not containa mechanism which fulfilled the criteria for migration in this work. The custom migration imple-mentation uses proxy references which contain a regular actor reference to allow migration withoutinvalidating references. These proxy references work by redirecting their messages to proxy targetsand perform migration by replacing their references when an actor is moved to another machine.The migration process uses state transfer and message buffering to ensure that actor migration isperformed in a non-disruptive manner. Additional requirements are added to the implementation ofactors in order to add support for state transfer, which is used during the migration process.

4.6.2 Exploiting Heterogeneous Hardware

Scaling and distribution are particularly useful for higher levels of workload, as a single machine isnot able to handle those levels on its own. Thus, applications are distributed on multiple machinesand these machines execute the application together. However, there are also workload levels whichonly exceed a single machine’s capacity by a small amount. Adding another machine of the sametype can enable the application to handle that level of workload, but the static energy consumptionis at least doubled. The gap between the additional provided resources and the actually requiredresources can be significant and energy consumption is increased unnecessarily. However, thiseffect can be toned down by making that gap and the increase in energy consumption as smallas possible. A different type of hardware can be used in this scenario. This type provides enoughresources to handle this slight increase in workload without using too much energy. Despite itslow energy consumption, continuously adding such small machines would not be beneficial to theapplication’s efficiency, because the overhead in coordinating and communication would use up toomany resources. Higher workload levels would need a different type of hardware, which is able tohandle them without consuming as much energy as two machines of the same type. These scenariosare ideal for heterogeneous hardware, which can have different peaks of efficiency. Their differingproperties make them suitable for different scenarios and each type of hardware is most efficient atits own workload range. Furthermore, these different types of hardware can be configured to furtherlower their energy consumption by using the methods provided by the configurator in Section 4.4.

However, heterogeneous hardware also adds restrictions to the platform. Some machines providemore configuration options, while other machines excel with their properties despite providing lessconfiguration methods. Furthermore, not all devices have access to the same information sources and

68


thus some devices provide less diverse information. This means that not all information monitoringmethods of the platform can be used by all device classes and many configuration methods are alsonot available depending on the used hardware.

The difference between the energy consumption of the used device classes constitutes thepotential energy savings. This difference can be quantified by executing an application with thesame level of workload on different device classes. In general, devices of the low-range class performmost efficiently at low workload levels compared to devices of other classes at the same workloadlevels. The use of low-range devices is mostly restricted by their maximum achievable performance,but low-range devices possess a large potential for energy savings at those levels. Mid-range classdevices handle workload levels above of that. The mid-range class usually performs more efficientlythan the high-range class at medium levels of workload. Naturally, workloads above medium levelsare executed most efficiently by the high-range class, as the other device classes cannot handle theselevels of workload. The difference between low-range and mid-range devices can be substantial.However, as their performance is low and their workload range is rather small, their potential energysavings aalso ppear to be small in absolute numbers. The difference between the mid-range andhigh-range classes can be much larger. This can result in a larger potential for power savings, as awide range of workload levels are processed more efficiently by the mid-range class.

Practically, the actual savings in power consumption are not as large as the potential savings,because inactive devices are not turned off entirely. This is done in order to keep response timesduring the migration process low. Consequently, the maximum potential power savings are decreasedby the standby power consumption of inactive devices. Fortunately, the amount of power usedby the devices in standby is very low, thus the actual power savings are only a little less thanthe maximum potential savings. This allows energy to be saved by migrating the application orparts of the application according to the applied workload levels by exploiting the properties ofheterogeneous hardware components.

4.7 Conclusion

The ReApper middleware platform shown in this work uses information and configuration ondifferent levels of a machine to improve the energy efficiency of actor applications and to lower themachine’s energy consumption. Different levels of the system provide different tools and options forconfiguration and different sources as well as types of information. This information is used to matcha machine’s resource usage to the amount that is required to handle the workloads experienced bythe application executed on the platform. This lowers energy consumption without having negativeimpacts on the application’s performance.

ReApper’s implementation uses Akka, although some of its modules are customized and some ofits concepts are exploited in order to add configuration and information collection functionalities. Theconfiguration options are extended by adding actor migration to the platform. Actor migration is usedto further improve energy efficiency, as different hardware components work more energy-efficientat different workload levels. Furthermore, this allows processing of higher levels of workloads, asthis also adds distribution, which enables scaling for actor applications on the ReApper platform.

Additionally, heterogeneous hardware provides flexibility in additional resources and energyconsumption, which allows applications to be better matched to the resources it requires at lowadditional energy costs. This flexibility is provided by using different device types, which each havetheir specific advantages and used for different scenarios.

The next chapter evaluates the performance and benefits of the ReApper platform and introducesexamples of actor applications which are used for ReApper’s evaluation.

69

5A N A LY S I S

This chapter begins with the evaluation of this work’s ReApper platform, which measures the impactof ReApper on the energy consumption and performance of sample applications. After that, otherworks with relevance to this work are introduced and compared to this work. The last sectiondiscusses limitations and further improvements for the ReApper platform and this work.

5.1 Evaluation

This sections presents this work’s evaluation and its results. First, the environment and setupused for the evaluation is described. The hardware which was used to execute the applications isintroduced first. These hardware components are divided into different classes, which are describedin Section 2.4.2. Then, the applications used for the evaluation are illustrated. Two differentapplications were implemented in order to measure the impact of the ReApper platform on theperformance and power consumption of applications. The last section of the evaluation contains theresults of the evaluation.

5.1.1 Environment and Setup

In Section 2.4.2, the fundamentals of ReApper’s utilization of heterogeneous hardware were intro-duced. However, only a general picture of the different device classes was given. This section showsthe hardware used for this work and describes practical limitations and properties using concreteexamples. Furthermore, the applications used to evaluate the ReApper platform are also introducedby describing their structure and implementation.

Hardware Setup

Table 5.1 shows three machines used in this work, where each machine is representative for a classof devices. The Odroid-C2 is a single-board computer equipped with an ARM quad-core processorand is the representative of the low-range device class. The representative of the mid-range classis a NUC device from Intel, which uses a Core U-Series processor. The last device is a workstationcomputer from Fujitsu with an Intel Xeon processor representing the high-range class, althoughdevices in this class can be more powerful than the presented example.

Low-range Class Other devices beside the Odroid-C2 were also taken into consideration, as manysingle-board computers have been developed since the introduction of the Raspberry Pi, whichmade this type of device widely known. There were devices, which could be operated at

71

5.1 Evaluation

Power Consumption FeaturesName Idle Load Processor C T RAM StorageODROID-C2 3.5 W 5 W ARM Cortex-A53 4 4 2 GB 64 GB eMMCIntel NUC6i5SYH 11.2 W 22.8 W Core i5-6260U 2 4 16 GB 128 GB SSDFujitsu W550 15 W 75 W Xeon E3-1275v5 4 8 16 GB 256 GB SSD

Table 5.1 – Heterogeneous hardware used in this work (C := cores; T := threads)

very low power levels, but their computing performance was unacceptable. The range ofworkloads, which could be handled by those devices, was very small and only the lowestworkloads were processable. Additionally, an Ethernet port was required in order to enablecommunication between different devices and the device should be able to provide gigabitconnectivity in order to avoid network limitations. A viable alternative were machines withIntel’s Atom processors, but their energy consumption was too high despite its acceptablecomputing performance. In the end, the Odroid-C2 was chosen, because it had a gigabitEthernet connection and provided an acceptable level of performance, while also using littlepower.

The Odroid-C2, which serves as this work’s low-range class device, provides a comparablylow performance in computing. Even though it is equipped with a quad-core processor,the performance is far below the devices of the higher classes. This is owed to the ARMarchitecture, which provides weak computing power but is able to do so at a very low level ofenergy consumption. Furthermore, it has only 2 GB of memory, which limits the machine’sperformance even more. However, it uses far less power and its power consumption it merely3.5 W at idle and only goes up to 5 W when it is under load. Consequently, the low differencebetween idle and load power consumption limits the benefits of reconfiguration at runtime.

Almost all configuration methods provided by the configurator can be used with this type ofdevice. The only exception is power limiting, as this requires support by the processor. Dynamicvoltage and frequency scaling can be used as a replacement, but the granularity is not as fineas the one provided by Intel’s RAPL. Moreover, because of its low performance, configuringthis type of device can reduce the performance to levels, which restricts its effectiveness tovery low levels of workload. This device is more limited in terms of information monitoringmethods, because most information provided by the hardware, such as energy information isnot available. Other information sources like load information or application metrics do notdepend on hardware features and are thus available on this device.

Despite their restrictions on configuration options and information sources, these devices areextremely useful for handling workloads which slightly exceed other classes’ capabilities andfor workloads during an application’s off-time. They can bridge those gaps with relativelylow energy costs, as they add a maximum of 5 W to the power consumption in contrast to theother devices, which need at least 10 W additionally.

Mid-range Class The devices in this class use more energy than the previously described single-board computers and less than conventional workstations or servers. Candidates for this classwere stronger and bigger ARM machines and laptop or notebook hardware. Most availableARM hardware is either designed to be used in mobile phones or tablets and were thus alreadycovered by the low-range class, or they were designed for usage in servers, which would farexceed the power requirements of this class. However, processors for laptops fitted this work’srequirement for the mid-range class better. Especially, the models designed for low-energy

72

5.1 Evaluation

consumption had an acceptable level of performance, while they only use a moderate amountof energy. This resulted in choosing Intel’s NUC6i5SYH, which is part of the NUC form factor.These devices utilize low-energy x86 processors and could thus provide performance on parwith laptops and smaller computers without reaching energy consumption levels of desktopcomputers. A gigabit Ethernet connection and many other common parts are also offered bythe NUC6i5SYH, as standard consumer hardware is used. Moreover, they use standard x86hardware, which allows utilization of all configuration and information collection methods,as development was mainly done on this kind of hardware.

The Intel NUC6i5SYH contains a Core i5-6260U dual-core processor, which features simulta-neous multithreading and is able to process four threads at the same time. Furthermore, itis equipped with 16 GB of RAM and can thus provide enough memory for most applications.This enables it to process respectable levels of workloads, which is comparable to lower-endand mid-end desktop machines. The power consumption of this NUC device is between 11 Wduring idle mode and 23 W under load. The NUC’s level of power consumption is situatedabove the Odroid-C2 and below desktop or workstation machines like the W550 used for thatclass. The difference between its idle and load energy consumption offers some room forconfiguration.

In contrast to the Odroid-C2, all configuration options are available to the NUC. Intel’s RAPLinterface provides a tool for power limiting and the other configuration options are also usable,because they do not depend on hardware features. RAPL in particular allows fine-grainedcontrol of the processor’s power consumption. However, only several power limits are available,as the range between idle and load power consumption only comes to around 11 W. Similarlyto configuration on the NUC, all information-monitoring methods are available on the NUCdevice and they can be used to monitor the system’s status in great detail.

The NUC6i5SYH is used for many workload levels, as it can handle a large range of workloads.Its energy efficiency is high compared to the high-range class, but it does not reach the levelsdisplayed by the Odroid C2. This type of device can handle workloads at medium and highlevels before its capacity is exceeded. These levels of workload are also handled more efficientlyby the NUC compared to the workstation computer which uses 4 W to 10 W more power thanthe NUC at the same level of workload.

High-range Class The high-range class contains computing hardware which is able to handleworkload levels beyond the levels handled by the mid-range class. The devices consideredfor this class of devices ranges from powerful desktop machines to workstations and serverhardware found in data centers. Usually, these type of machines consume a considerableamount of power and energy at their highest performance, but can also reach very lowconsumption levels in their idle states. The requirement for this class is to provide morecomputing power than the mid-range class, while having as little static power consumption aspossible. Of course, a gigabit Ethernet connection was also required in order to connect themachine to the other machines used in this work. The Fujitsu Celsius W550 was chosen, asit featured a powerful processor which is part of the Skylake processor generation like theprocessor used in the NUC. This avoids effects on the performance and power consumptionbecause of improvements in the processor architecture. The processors of the NUC andworkstation computers have the same processor architecture and should be able to reachsimilar efficiency at executing the same pieces of code.

The Fujitsu Celsius W550 is a standard tower computer. It contains an Intel Xeon E3-1275v5quad-core processor. This processor features simultaneous multithreading, which allows

73

5.1 Evaluation

processing of eight threads at the same time. Furthermore, the machine is equipped with16 GB of RAM, which satisfies most applications’ memory requirements. Moreover, the highestlevel of workload which can still be handled by this machine marks the highest level ofworkload which is of interest for the evaluation of this work. Thus, all levels of workloadsrelevant to this work can be handled by this machine. The power consumption of the W550is between 15 W when it is idle and 75 W under full load. The idle power consumption ofthe W550 is only a little bit higher than the NUC’s power consumption, but the upper limitof the power consumption is well above that of the NUC. The wide range between powerconsumption during idle and load allows configuration for different levels of workload.

The W550 provides all configuration options like the NUC device of the mid-range devicedoes. However, due to the wider range between the idle and load power consumption of theW550 there are many viable power limits, which are used to restrict the machine’s powerconsumption. The workstation machine also supports all information monitoring capabilitiesprovided by ReApper.

The Fujitsu Celsius W550 can be used for all workload levels, but it is usually not as energyefficient as the mid-range or the low-range device at lower workload levels. Nonetheless, theperformance provided by W550 is at two to three times higher than the performance providedby the used NUC device, but the W550’s power consumption can also be three times higherthan the power consumption experienced by the NUC. Consequently, the difference betweenthe W550 and the Odroid-C2 is even larger than its difference to the NUC.

Measurement Setup

The measurement setups contains one machine of each device introduced in the previous sectionfor a total of three different machines. All three machines are connected to a gigabit network,which allows them to communicate with each other. The power measurement device provides threepower plugs. The machines are plugged into the measurement device’s power plug when theirpower consumption is measured. Machines are exclusively attached to the measurement device’spower plug, when the power consumption of a single machine is evaluated. In other cases wherethe effects of ReApper’s migration mechanism on the power consumption is evaluated all threedevices are plugged into the measurement device’s power plug. The power measurement deviceis connected to an additional machine, in order to separate control of the power measurementdevice from the measured machines. This machine is solely responsible for collecting the powerinformation produced by the measurement device. Another separate machine is used to start theapplications and trigger measurements. This additional machine is also responsible for creatingrequests and the workload for the applications used in this work’s evaluation. Consequently, fivedifferent machines are used in the measurement setup. Three of those devices (the Odroid-C2, theIntel NUC6i5SYH, and the Fujitsu Celsius W550) execute evaluation applications, while the othermachines are used to control the power measurements and the evaluation process.

The evaluation is divided into two main scenarios. The first scenario is used to measure theeffects of ReApper’s configuration methods on a machine’s power consumption during the executionof different applications. These measurements are focused on the effects of power limiting of theprocessor on a machine’s performance and power consumption. Furthermore, these measurementsare limited to the NUC device and the workstation (Fujitsu Celsius W550), as the Odroid’s potentialfor power saving is limited. ReApper’s configuration methods result in significant impacts on theOdroid’s performance, while the Odroid’s power consumption is only lowered by a small amount.The results of this scenario are presented as a diagram showing the machines’ power consumption

74

5.1 Evaluation

initialize transfer operate transfer shutdown actors

suspend suspend

powered off powered off

wake up

immigrate emigrate

suspend

Figure 5.1 – Steps of the migration process in detail

for each application at different power caps and workload levels. The second scenario evaluates asample run of the applications. Different workloads are applied to the applications during such a run.These different workloads are called workload profiles. The sample runs are mainly used to measurethe effects of utilizing heterogeneous hardware and actor migration on the power consumption ofthe execution of applications. The results are presented in two diagrams for each application. Oneof these diagrams shows the course of the machines’ power consumption, while the other diagramsshow the course of the applied workload.

The values of the graphs are based on five measurements. The measurements of the effectsof power caps on the applications were performed by setting the power cap and measuring thepower consumption at different workload levels. The duration of each measurement cap was 30 s.Each measurement was averaged, before the average of all five measurements was calculated. Themeasurements of the sample runs for each application were performed in a similar way. Howeverinstead of manually configuring the machines’ resources, the configuration and migration wasperformed according to profiles which were created before the sample runs. These profiles arebased on the experience drawn from the measurement of the power cap’s effect on the applications.Each sample run was also performed and measured five times. Another difference to the power-capmeasurements is the duration of each measurements. Different workloads were applied for 60 s ata time thus also increasing the measurement duration to 60 s. The averages were also calculateddifferently. Instead of averaging the power consumption over the duration of a measurement, theaverage over 10 s for the whole sample run was calculated. However, each set of five measurementswas also averaged, before it was plotted.

Evaluation Applications

Two different applications were implemented for this work’s evaluation. The first application is akey–value store, which provides common storage operations similar to key–value stores used inproductive environments. The second application is a data-stream–processing application, whichextracts data from a data-stream source and processes the extracted data using a map-reduceapproach.

Moreover, the migration process is extended with mechanisms for suspending and wakingmachines. This process is illustrated in Figure 5.1. Unused machines are started in a suspendedstate or they are powered off. When the decision is made to migrate an actor to an unused machine,that machine is woken up. The first step is to initialize actors of the type which are migrated tothat machine. The process of migration to a machine is also called immigration. Then their state istransferred to the machine. The actors are fully operational after completing their state transferand the hosting machine is in the operate state. Machines which host actors start in the operatestate. When actors are migrated to another machine, the emigration process is started. This happensconcurrently to the immigration process on another machine. After the initialization has finished on

75

5.1 Evaluation

the target machine, the state transfer is performed on the current host machine. The next step ofthe emigration process is to shutdown the actors, which have migrated to another machine. Thenthe former host machine is either suspended or powered off in order to lower or entirely remove itspower consumption from the measurement.

Key-Value Store Actor Group

ConnectorChecksum GeneratorCompressorExecutor ClientStorage / Memory

requestrequest checksumrequest compressionexecuteaccess

data / confirmationdecompress data verify checksum complete request send answer

Figure 5.2 – Key–value store structure

Key–value Store The key–value store for this work’s evaluation is implemented with actors in orderto execute it on the ReApper platform. The structure of the key–value store is illustrated inFigure 5.2. An actor group of the key–value store contains four different actor types. Theconnector provides the interface for clients. This actor accepts requests and sends them to theother actors of the key–value store for processing. The first processing actor is the checksumgenerator. This actor calculates the checksum of the data, which was sent to the key–valuestore by the client. Then the checksum is added to the request and sent to the compressor. Thecompressor applies compression to the data and replaces the request data with the compresseddata. This modified request is then sent to the executor, which accesses an underlying storage,database, or the machine’s memory to execute the request. This process results in eitherdata or a confirmation, which is sent back to the compressor for decompression. The reply isthen sent to the checksum generator in order to verify the checksum. After that, the reply ishanded over to the connector, which sends the reply to the client. Although several differentoperations were implemented, only the SET operation is used in this work’s evaluation inorder to guarantee similar workloads, which kept the measurement process simple.

Data-Stream–Processing Actor Group

Data ConsumerMapperReducer Data Sourcefetch datasendprocess

Figure 5.3 – Data-stream processing structure

Data-stream Processing The data-stream–processing application is also implemented as an actorprogram. It uses the map-reduce approach to process string data and extract informationfrom that data. The application’s structure is presented in Figure 5.3 An actor group of thedata-stream application contains three different types of actors. The data consumer is used tofetch data tuples from an external data-stream source. It actively connects to a data source inorder to fetch the data tuples, while the key–value store waits passively for client requests.The data from the data-stream source is then sent to a mapper actor. The mapper applies a

76

5.1 Evaluation

map function to the data. After several tuples have been processed, the results are sent to thereducer. The results of the mapper actor are not sent immediately in order to avoid floodingthe reducers with many small requests. The reducer applies a reduce function to the mapperresults and combines the results. The processing of the data-stream application ends with thereducer.

5.1.2 Key–value Store

The results for the evaluation of the key–value store are divided into two different parts. The firstsection describes the effect of power caps on the power consumption and the performance of thekey–value store. This is further divided into diagrams for different machines. The second sectionanalyses a sample run of the key–value store with different workloads. Apart from the analysis ofthe power consumption and workload, a short discussion about latency is also part of this section.This is exclusive to the key–value store.

Power Caps The effects of power caps are evaluated in two ways. The first section contains anevaluation of the power caps’ effects on the machine’s power consumption. The second sectionpresents an analysis of the effects caused by power caps on the application’s latency.

– Power Consumption Figure 5.4 presents the power consumption at different workloadslevels with different power caps for the key–value store. Power caps affect not only thepower consumption, but they also restrict the maximum processable workload for themachine. The power caps were applied to the PP0 and PKG domains. Each power caprestricts the power consumption of the processor to a certain value. For instance, the40 W power cap of the workstation limits the total system power consumption to around60 W, while the 20 W power cap restricts the total power consumption to around 50 W.Only the 5 W power cap of the workstation is slightly inaccurate, as it limits the systempower consumption to around 20 W, while it should have restricted it to 25 W. A similarbehavior is also displayed by the NUC.

0 25 50 75 100 125 150 175 200 225 250 2750

10

20

30

40

50

60

70

Throughput [kOps/s]

Pow

erCon

sumption

[W]

No limit40W30W20W10W5W

(a) Workstation power consumption with differentpower caps

0 25 50 75 100 1250

5

10

15

20

25

Throughput [kOps/s]

Pow

erCon

sumption

[W]

No limit10W8W6W4W2W

(b) NUC power consumption with different power caps

Figure 5.4 – Power consumption of the key–value store with power caps on different devices

77

5.1 Evaluation

Figure 5.4a shows the results of power capping on the power consumption of the worksta-tion machine. The performance is at its highest when no power cap is set, but the powerconsumption is also at its highest without a power cap. The power caps at 40 W, 30 W,and 20 W lower the power consumption by exactly 10 W compared to their next higherpower caps. However, power caps lower than 20 W do not affect the power consumptionas much as the higher power caps do. They only limit the machine’s performance withoutsaving a lot of power compared to the power cap at 20 W. The workstation’s highestpotential power savings are achieved with power caps of 20 W and above. However, thepower cap at 5 W also provides a considerable reduction in power consumption thusbeing a useful example for a lower power cap. Consequently, significant power savingscan be achieved with this method, although not all power caps are viable or useful forpower saving.Figure 5.4b illustrates the power consumption of the Intel NUC device. The range ofpower caps for the NUC is much smaller than the range of the workstation. Only powercaps below 10 W lower the NUC’s power consumption in contrast to the range of 40 Wprovided by the workstation. Consequently, this leaves only a small range of powercaps, which can be applied to the NUC. Higher power caps are more effective at savingpower compared to lower power caps. This behavior is also seen with the workstation’spower caps. The biggest savings are achieved with the 8 W limit, although this onlyresults in a reduction of 2 W or 3 W compared to the next higher power cap. The lowerpower caps have an even smaller effect on the power consumption and their effectson the NUC’s power consumption are almost negligible. However, the performance isseverely restricted by power caps. Nonetheless, the NUC’s base power consumption isstill lower compared to the workstation’s base power consumption. The difference ofpower consumption between the NUC and workstation at similar workload levels rangesfrom 3 W up to almost 10 W.

– Latency Figure 5.5 shows the latency of the key–value store’s execution on different deviceswith different power caps from the client’s view. This figure is further divided into adiagram of the workstation’s latency and a diagram of the NUC’s latency.

0 50 100 150 200 250 3000

2

4

6

8

Throughput [kOps/s]

Laten

cy[m

s]


(a) Workstation latency with different power caps

0 25 50 75 100 1250

2

4

6

8

10

12

14

Throughput [kOps/s]

Latency

[ms]

No limit10W8W6W4W2W

(b) NUC latency with different power caps

Figure 5.5 – Latency of the key–value store with power caps on different devices

78

5.1 Evaluation

Figure 5.5a shows the latency of the workstation with different power caps. The latencyof the workstation with power caps of 20 W and higher are mostly stable until around200,000 operations per second. The latency with those power caps is 0.35 ms on average.However, the latency with power caps of 10 W and below exhibit a substantial overallincrease in latency compared to the latency of the key–value store with other powercaps. This results in an average latency of 0.5 ms with the power cap of 10 W and anaverage latency of 0.8 ms with the power cap of 5 W. In general, the latency starts tospike as soon as the device’s processing-capacity limit is reached. As power caps lowerthe workstation’s processing-capacity limit, lower power caps cause the latency to spikeearlier. The latency with the power caps of 30 W and 20 W and the latency without apower cap spike up to 6 ms, while the latency with lower power caps spike at lowervalues between 1.5 ms and 4 ms.

Figure 5.5b illustrates the latency of the NUC with different power caps. The generalbehavior of the NUC’s latency is similar to the behavior shown by the workstation; thelatency starts to spike as soon as the processing-capacity limit is reached. The majordifference to the workstation’s latency is the maximum latency at 14 ms with the powercap of 2 W. Yet, the other latency spikes are also smaller (between 1.5 ms and 4.5 ms)compared to the latency spikes of the workstation. However, the average latency starts todeteriorate with lower power caps. The average latency is between 0.35 ms and 0.4 mswithout a power cap and with the power caps of 10 W and 8 W. The power caps of 6 Wand 4 W cause the average latency to rise to 0.5 ms and 0.75 ms respectively. The powercap of 2 W causes the latency to reach a level which reduces the service quality of thekey–value store by a large amount. In this region, power caps are rather disadvantageous,because of their massive effect on the latency.

Consequently, the latency is usually not affected by power caps as long as the processing-capability limit is not reached. The average latency is not affected much when higherpower caps are used and the workload levels do not exceed the processing-capability limit.However, when the processing capacity limit is reached, the latency rises to unbearablelevels. Power caps only move that limit to lower maximum workload values. This meansthat latency is not affected too much as long as the power caps are raised before theprocessing capacity limit is reached.

Sample Run Figure 5.6 contains the results for a sample run of the key–value store under differentworkload scenarios. Figure 5.6a shows the power consumption during the course of the samplerun, while Figure 5.6b illustrates the throughput from the client’s point of view. Figure 5.7presents the latency of the key–value store during the course of the sample run. The graph ofReApper with migration represents the executions of the key–value store on all three devices,as the application is moved between the machines at different workload levels. The graph ofReApper without migration and default Akka results from the execution of the key–value storeon the workstation. Only the power consumption of the workstation was measured for thesegraphs, while the power consumption of all three devices was measured for the remaininggraphs.

– Workload Profile The workload profile imitates the usage of a service over the course of aday. It starts at very low levels, which are experienced during the night or in the earlymorning. Then, the workload increases until it reaches its peak during the day at 220,000requests. After its peak, the workload slowly decreases until it reaches low levels again.The workload level is mostly the same between ReApper with migration and ReApper

79

5.1 Evaluation

0 180 360 540 720 900 1,080 1,260 1,4400

5

10

15

20

25

30

35

40

45

50

55

60

65Odroid NUC Workstation NUC

Time [s]

Pow

erC

onsu

mp

tion

[W]

Default Akka on workstationReApper without migration on workstation

ReApper with migration

ReApper with migration and poweroff (potential)

(a) Power consumption

0 180 360 540 720 900 1,080 1,260 1,4400

25

50

75

100

125

150

175

200


Time [s]

Throughput[kOps/s]



(b) Workload profile

Figure 5.6 – Key–value store sample run with different workloads

without migration. The workload profile of default Akka serves as reference for the othergraphs. The workload profile of ReApper without migration mostly overlaps with theprofile of default Akka. Nonetheless, changes to the workload are slightly delayed in thegraph of ReApper without migration, as some time is required to adjust the resourcesto the applied workload. The workload profile of ReApper with migration experiencessmall drops during migration phases and it takes around 10 s to reach the targetedlevels again. Additionally, ReApper with migration also requires some time to adjust theresources of the machine to the applied workload level which adds an additional delay toachieving the targeted workload levels. Although, the workload drop during migrationonly happens when the application is moved from a smaller machine to a more powerfulmachine. This occurs two times at 240 s and at 600 s where the application is movedfrom the Odroid to the NUC and then from the NUC to the workstation. The throughputis only slightly affected when the application is migrated from the workstation to theNUC at 1020 s.

– Power Consumption The power consumption rises with the applied workload levels anddecreases as the workload decreases. The highest power consumption is between 720 sand 780 s at 60 W to 67 W. The graph of default Akka’s power consumption is higherthan the other graphs at almost all workload levels. The graph of ReApper withoutmigration is significantly lower than default Akka and follows the trend set by the powerconsumption graph of default Akka. The graph of ReApper with migration is below thepower consumption of the other graphs at the lowest and lower workload levels. The ex-ecution on ReApper with migration has significantly lower power consumption comparedto the execution on the workstation in the time span between 10 s and 240 s. The powerconsumption of ReApper with migration spikes at 240 s, as the application migrates fromthe Odroid to the NUC. This results in higher power consumption than the executionon the workstation until 360 s, where ReApper with migration can achieve lower powerconsumption again. This lasts until 600 s, where the migration from the NUC to the

80

5.1 Evaluation

0 180 360 540 720 900 1,080 1,260 1,4400

5

10

15


Time [s]

Laten

cy[m

s]



Figure 5.7 – Latency of key–value store sample run with different workloads

workstation takes place. The time span between 600 s and 1020 s shows higher powerconsumption than the execution on the workstation, as the idle and suspended powerconsumption of the Odroid and the NUC increase the workstation’s power consumptionwhen the key–value store is executed with migration enabled. Overall, power savingsare achieved at low workload levels, but the medium and higher workload levels causean increase in power consumption compared to the execution on the workstation.

When the idle and suspend power consumption is removed, these workload levels can beprocessed at the same efficiency as the execution on the workstation. The machines cannotbe turned off completely due to the current implementation of ReApper, as low responsetimes are required by the current implementation to use migration in this evaluation,which cannot be achieved with powered off machines. However, this requirement is notneeded for a modified implementation. Many applications do not require low machineresponse times, as workload levels can be predicted and the machines can be poweredon in time before a migration takes place. Thus, the potential course of the graph isillustrated in Figure 5.6a with a green dashed line. This also results in larger powersavings compared to the execution with default Akka and the ReApper platform withoutmigration, but these savings are limited to lower and medium workload levels. At lowworkload levels (10 s until 240 s), power consumption can be reduced by up to 10 %compared to the other graphs. The savings are smaller at medium workload levels (270 suntil 610 s) and reach around 4 % to 7 % compared to default Akka. Between 780 s and950 s power savings of up to 14 % are reached again. Then at 1040 s until the end of thegraph, the power consumption is reduced by 5 % to 10 % again.

– Latency Figure 5.7 shows the latency from the view of the client over the course of thesample run. The graph of default Akka serves as reference again. The latency of theexecution with default Akka is mostly uniform over the course of the sample run andaverages at 0.5 ms. The average latency of the execution with ReApper without migrationrises at several points during the sample run for several minutes (e.g., 240 s – 360 s,1020 s – 1440 s). This is caused by ReApper’s reconfiguration, as power limits are set atcertain workload thresholds which cause the latency to rise. Furthermore, the latency

81

5.1 Evaluation

of the execution with ReApper without migration is also generally higher than thelatency of the execution with default Akka. This is caused by the power limits set byReApper. The latency of the application’s execution with migration-enabled ReApper ison average significantly higher. Especially, in the earlier phases of the sample run whenthe application is executed on the Odroid the latency is particularly high compared tothe other graphs. The cause for this is the Odroid’s low processing-capability limit, whichresults in a high average latency even at low workload levels. Another major differenceto the other graphs is the occurrence of latency spikes during migration phases, whichcause the latency to spike at over 20 ms during the migration from the Odroid to theNUC and at 4 ms during the migration from the NUC to the workstation. However, themigration from the workstation to the NUC does not cause a spike, but the averagelatency is higher after the migration. ReApper with migration has a large influence onthe key–value store’s latency and ReApper without migration affects the latency as well.The impact of migration-enabled ReApper is especially large when the application isexecuted on the low-range device. Furthermore, migration phases cause spikes, whichlast for the duration of the migration, but the latency returns to normal levels afterwards.ReApper without migration also causes the average latency to rise to higher levels, asreconfiguration is performed. In conclusion, ReApper causes the overall latency to rise,but can keep up with default Akka’s latency at certain workload levels. Nonetheless, thelatency takes a hit to its performance when ReApper is used.

5.1.3 Data-stream Processing

The evaluation of the data-stream–processing application executed on the ReApper platform isalso structured like the evaluation of the key–value store. Initially, the effects of power cappingon the data-stream-processing application is described. This description is divided into differentgraphs, each for the different types of evaluated machines. Then, a sample run of the data-stream-processing application is presented. This shows the power consumption of the execution whendifferent workload levels are applied. The latency of the data-stream–processing application wasnot analyzed.

Power Caps Figure 5.8 shows the power consumption of the data-stream-processing applicationwith different power caps on the evaluation machines. Power caps have a varying degreeof effectiveness on the power consumption, as some power caps are more effective thanothers. Moreover, the different machines offer different ranges for power capping, whichfurther restricts their effects on a machine’s power consumption. However, power caps arealso restricting on the performance of the data-stream processing application, as power capslimit the performance provided by the processor.

Figure 5.8a shows the effects of power caps on the power consumption of the workstation.The maximum workload was restricted to 320,000 tuples per second, as the network interfacebecame a limiting factor, which prevented the workstation to process higher workload levels.In addition to setting no power cap, the power caps of 40 W and 30 W can be used to processthe highest levels of workload. Like in the evaluation of the effects of power capping on thekey–value store, the power caps of 40 W, 30 W, and 20 W provide the highest power savings, asthey are able to reduce the power consumption by around 10 W at the same level of workloadcompared to the next higher power cap. The lower power caps of 10 W and 5 W are not aseffective as the higher caps, but they are still able to reduce the power consumption by 2 W to4 W. However, the 5 W limit restricts the performance severely while providing little savings.

82

5.1 Evaluation

0 50 100 150 200 250 300 3500

10

20

30

40

50

60

Tuples [kTuples/s]

Pow

erCon

sumption[W

]


(a) Workstation power consumption with differentpower caps

0 50 100 150 2000

10

20

30

Tuples [kTuples/s]

Pow

erCon

sumption

[W]

No limit10W8W6W4W2W

(b) NUC power consumption with different power caps

Figure 5.8 – Power consumption of the data-stream–processing application with power caps ondifferent devices

This confirms the observations made in the evaluation of power capping the key–value store.Power capping the workstation machine can provide significant power savings, but not allpower caps are useful, as the effectiveness of lower power caps diminishes compared to higherpower caps.

Figure 5.8b shows the effects of power capping on the power consumption of the NUC device.Unlike the workstation, the NUC is not limited by its network interface, but it’s rather restrictedby its available computing resources, which limits the highest workload to around 200,000tuples per second. In general, power caps on the NUC are only slightly effective. Notablepower caps are 10 W, 8 W, and 6 W. They lower the NUC’s power consumption by 1 W to3 W. Power caps which are lower than 6 W severely reduce the NUC’s performance withoutimproving the NUC’s power consumption compared to higher power caps at the same workloadlevels. Similar to setting power caps on the NUC for the key–value store, power caps slightlyimprove the NUC’s energy proportionality, but its effects are limited by the small range forpower caps and the NUC’s inherently low-power consumption.

Sample Run Figure 5.9 contains the results of a sample run of the execution of the data-stream-processing application with different workloads. Figure 5.9a shows the power consumptionduring the course of the sample run. Figure 5.9b illustrates the applied workload from thedata source’s point of view. The red graph shows the execution of the application with defaultAkka on the workstation, thus only the power consumption of the workstation was measuredfor this graph. This serves as reference for the other graphs. The orange line is the result of themeasurement of the application’s execution of ReApper without migration on the workstation.Therefore, only the workstation’s power consumption was measured for this graph. The blueline represents the execution of the application with ReApper with migration enabled on alldevices used for the evaluation.

– Workload Profile Again, the workload profile is imitating the usage levels of service duringthe course of a day. The beginning of the workload profile is characterized by low usage,

83

5.1 Evaluation

0 180 360 540 720 900 1,080 1,2600

5

10

15

20

25

30

35

40

45

50

55 Odroid NUC Work-

stationNUC

Time [s]

Pow

erC

onsu

mp

tion

[W]



ReApper with migration and poweroff (potential)

(a) Power consumption

0 180 360 540 720 900 1,080 1,2600

50

100

150

200

240

Odroid NUC Work-

stationNUC

Time [s]

Tuples[kTuples/s]

Akka default on workstationReApper without migration on workstation


(b) Workload profile

Figure 5.9 – Data-stream processing sample run with different workloads

as it is usually low at night or in the early morning. After that, the workload rises tohigher levels until it reaches its peak at 240,000 tuples per second during the day, whichis the time of highest usage during the course of a day. Then, the workload decreases tolower levels again. These workload levels are usually experienced at night. The executionwith default Akka adjusts to the applied workload the fastest, while the execution onthe ReApper platform takes around 10 s before the applied workload is reached. Theexecution on ReApper without migration adjusts to the applied workload very fast inmost cases. The workload profile of the application’s execution on the ReApper platformwith migration enabled also adjusts to the applied workload quickly. However, there is asmall drop in tuples per second during the migration from the workstation to the NUC at900 s. This contrasts with the workload profile of the key–value store’s execution withthe migration-enabled ReApper platform, which experiences throughput drops duringmigration phases from weaker machines to more powerful machines.

– Power Consumption In general, the power consumption during the application’s executionrises when the workload increases and lowers when the workload decreases again. Theexecution on migration-enabled ReApper is more efficient than the execution on theother platforms for low workload levels between 10 s and 240 s. The application is thenmoved from the Odroid to the NUC. The power consumption rises to levels above ofReApper without migration in the following time frame until 720 s. At this point theapplication is moved to the workstation from the NUC. In this period, the migration-enabled execution is particularly inefficient compared to the execution with ReApperwithout migration enabled. This is caused by the power consumption of the unusedmachines, which add their idle and suspended power consumption to the total powerconsumption measured during that time. Then at 900 s, the application is migrated fromthe workstation to the NUC again. The power consumption of the migration-enabledexecution of the application sinks to levels similar to the execution on the ReApper

84

5.1 Evaluation

platform without migration. This trend continues until the end of the measurement. Thepower consumption of the application’s execution with default Akka is consistently higherthan the execution with ReApper. The exceptions are the spikes in power consumptioncaused by the migration of the application. This allows the ReApper platform to save up to10 W compared to the execution with Akka. Consequently, power savings can be achievedby ReApper during the whole sample run. However, the migration of the application doesnot provide power savings, as the idle and suspended power consumption of the inactivemachines diminish the power savings achieved by the execution of the application onmore energy-efficient hardware.

The situation changes drastically when the suspended and idle power consumption ofinactive machines is removed from the measurement by powering those machines off.This is illustrated with the green dashed graph in Figure 5.9a. The current implementationof the migration mechanism prevents machines from being powered off, as this increasestheir response time. Low response times are needed by the current implementation inorder to be able to migrate applications in a timely manner for this evaluation. However,this becomes less important when the course of workload level can be predicted fromprevious experience. This information can then be used to power on machines in timebefore workload levels are reached, which require the migration of the application.In addition to that, the execution usually lasts longer than the executions do in themeasurements for the evaluation. This provides additional time for the unused machinesto be powered off. Therefore, the graph of the power consumption is constantly below theother graphs in this situation and high power savings can thus be achieved by exploitingheterogeneous hardware. The power savings are tremendous especially at the beginningof the measurement (10 s until 240 s), where they can reach up to 75 %. In the followingsegment (260 s until 600 s), the power consumption can be reduced by 16 % to 25 %compared to ReApper without migration and by 25 % to 40 % compared to default Akka.These levels of power savings can be retained throughout the remaining graph comparedto default Akka except for some drops due to the migration of the application. However,between 600 s and 900 s the power consumption is equal to the power consumption ofthe execution with ReApper without migration. Nonetheless, after 900 s power savingsof 10 % to 28 % are achieved again.

5.2 Discussion

There are some points and issues which were influenced by this work’s implementation. Compromiseswere made in order to keep the scope of this work manageable. Moreover, some of the methodsintroduced and described in this work are not suitable for all kinds of scenarios, applications, andenvironments. These issues and points are discussed in this section.

The ReApper platform focuses most of its power- and energy-saving efforts on the processor’spower consumption. This resulted in the implementation of methods, which are used to limit thecomputing resources’ energy consumption. However, this approach is not as effective for applicationswhich are not demanding on the processor. These types of applications are mainly bounded by I/Odevices, such as storage devices or network interfaces. Because of their vastly different nature, othercomplex issues and requirements need to be considered in order to lower the power consumption ofthese applications. These issues include controlling and targeting the correct devices of a machine,complex relationships as well as dependencies between data, and data locality.

85

5.2 Discussion

Another concern is finding optimal configurations for different applications. The exact thresholdsfor reconfiguration and migration of different applications heavily depend on the applications them-selves. The process of finding these thresholds can require a large amount of time, as the workloadvaries and the employed hardware components determine the number of available configurationsand thresholds. This process can either be performed manually or automatically with the help oftools. However, no power savings are achieved and the application’s performance might decrease,while configurations are still being determined, as the available resources are not used optimallyduring that time.

Some aspects of the implementation of ReApper are also worth discussing. Oftentimes, actorsare migrated back to a former host machine, after they have been migrated to another machine.Currently, migration causes the actors to be terminated on the former host machine. This results inadditional costs when the actors are initialized during their migration back to a former host machine.Additionally, costs for cleanup operations performed when the actors are terminated are also caused.These costs can be avoided by suspending actors instead of terminating them. As inactive actorsdo not use computation resources, the only costs caused by suspended actors are memory costs.Fortunately, the memory footprint of an actor is small, because only its behavior needs to be saved.The state of an actor can be removed, as a migration back to the same machine would overwriteany saved state. This would reduce the cost of migration, which would then only consist of the statetransfers and the costs for initializing the actors and their resources would be avoided. However,these savings can only be achieved with machines which have already hosted that actor before. Thus,migration to a new machine would still incur the full costs of the migration process. This mechanismwas not implemented in order to keep the migration mechanism’s complexity at manageable levels,but this could be used to speed up the migration of actors.

The high-range device used for the evaluation is not a typical representative for the high-rangeclass, as the used workstation is considerably more energy-efficient than typical servers seen in datacenters. Furthermore, servers usually have higher power consumption than a workstation machine[64], although they are also equipped with significantly more powerful hardware, which enablesthese servers to provide more performance as well. An example for a typical server is providedby Kliazovich et al. in [65]. Their server has a total power consumption of around 300 W. 130 Wis caused by the processor, while 170 W is used for other peripheral devices. Another example isprovided by Anagnostopoulou et al. in [43]. The power consumption of their evaluated server isbetween 200 W and 450 W, however no specific allocation of the power consumption is mentioned.Using a more typical server instead of a very energy-efficient workstation would significantly increasethe differences between the different device classes – in particular between the mid-range andhigh-range classes. Consequently, this also increases the potential power savings from exploiting thedifferent classes’ different properties and power consumption levels.

5.3 Related Work

This section describes two approaches which are related to this work’s approach. The first sectionshows work which lowers the energy consumption of applications statically by employing energymodeling and profiling. The second section shows how migration and scaling is used in other works,as other goals and requirements drive the motivation for migration and scaling in these works. Athird related field of work is described in Section 2.3.2. That section describes reconfiguration as atool for energy saving proposed by related work.

86

5.3 Related Work

5.3.1 Energy Modeling and Profiling

One approach to decrease energy consumption is to optimize the code while retaining all functionality[66, 67]. This process is closely related to code profiling and energy modeling of code where the codeis analyzed before it is executed. This analysis combined with an energy model for the execution ofinstructions provides the information to decide which implementation of several options consumesless energy. However, other properties like different latency behavior or execution time can beaffected, as different implementations are used. Aside from that, changing the implementationis invasive and could result in large code changes. Profiling and modeling are rather static andmostly tailored around a predefined workload, while real workload often varies greatly during anapplication’s runtime. This work tries to adapt to those changes in a dynamic way and to adjustrequired resources depending on actual demand. Furthermore, the presented implementation triesto be as non-invasive as possible, as only minor code changes are required. Nonetheless, profilingprovides useful information to lower an application’s baseline of energy consumption and can beused orthogonally to this work’s presented approach.

5.3.2 Migration and Scaling

Scaling by means of actor migration was also proposed by Imai et al. in [68]. This approachis closely related to this work’s implementation. However, their focus was to develop a general-purpose–migration mechanism for a cloud environment. Their goal is to reach higher average loadof VMs while avoiding overload of a single VM. Even though this goal is related to this work’s goal,energy consumption was not of concern in their work. Furthermore, they used SALSA [13] as thebasis of their work. This work diverges by using Akka, which sees use in production systems andprovides higher performance than SALSA. By combining scaling using actor migration and machineconfiguration this work tries to add an energy aspect to scaling which can ultimately be used tolower energy consumption of applications. Newell et al. present a mechanism to migrate Orleansactors in order to improve latency and performance [25, 69]. They automatically place frequentlycommunicating actors on one machine to decrease the communication overhead and thereforeimprove overall application performance. A similar mechanism is also implemented by Imai et al[68]. Such an automated migration mechanism could be extended with energy awareness so thatnot only latency is reduced and performance increased, but energy consumption is considered aswell. However, such a mechanism is beyond the scope of this work. Nonetheless, it would makeenergy-aware actor placement even more accessible.

87

6C O N C LU S I O N

This work presents the ReApper platform, which provides an energy-aware execution environment.Configuration and information monitoring is used in order to adjust the resource usage of appli-cations to their workloads. The ReApper platform allows energy-unaware applications to becomeenergy-aware, although some modifications are required in order to acquire this property. The actormodel serves as the programming paradigm for applications, as this is used to enhance the platform’sconfiguration with a migration mechanism for applications. This migration mechanism allows theplatform to provide additional configuration methods. Moreover, it is used to exploit the differentproperties of heterogeneous hardware, as actors are moved to more energy-efficient hardware atcertain workload levels. Thus, this work’s ReApper platform utilizes migration for energy purposesinstead of performance or scaling, which are common purposes for migration mechanisms.

Configuring the resources of a machine can lead to massive power savings compared to theexecution without resource configuration. This enables improvements of energy efficiency andachieving better energy proportionality. However, adding migration has mixed results. Large powersavings can be made at lower workload levels, but power savings at medium and higher workloadlevels prove to be more difficult to achieve. Nonetheless, there is still potential for power savings inusing heterogeneous hardware. Improvements to the implementation and powering off machineswhen they are not used can make these potentials available, which would in turn lower the overallpower consumption even further. This is particularly interesting when machines have large differ-ences in their properties and features such as static and dynamic power consumption, as seen withthe low-range class compared to the other two classes. Consequently, finding other devices thatfit into this work’s proposed device classes probably enable higher power savings without losingperformance or impairing other qualities.

Additionally, determining the need for migration could also be based on workload predictionsand past workload trends instead of solely relying on current workload levels. This would removethe need for hot-standby resources, which would also lower the power costs caused by those unusedresources. Furthermore, more configuration methods and information sources could allow morefine-grained control of a machine’s resources and improve the analysis and estimation of a machine’sstatus. Apart from configuration and information, using other factors than workload levels as theapplication’s main metric would shift the optimization efforts to other dimensions of the configuration,which could allow more aggressive power savings. Finally, focusing on other resources than theprocessor would make the approach proposed by this work more useful to applications which arenot bound by the CPU.

89

L I S T O F A C R O N Y M S

JVM Java Virtual Machine

CPU Central processing unit

TCP Transmission Control Protocol

RAPL Running Average Power Limit

PCM Performance counter monitor

OSS Open source software

API Application programming interface

PID Process identifier

URL Unified resource locator

PSU Power supply unit

MSR Model-specific register

APM Application Power Management

TDP Thermal design power

VM Virtual machine

QoS Quality of service

DVFS Dynamic voltage and frequency scaling

NUC Next-Unit-of-Computing

IPC Instructions per clock

IC Integrated circuit

OS Operating system

91

L I S T O F F I G U R E S

2.1 Schematic of an actor’s behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Message delivery in an actor system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Structure of Akka components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Exemplary actor hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Akka actor paths, references, and addresses . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6 Migration on VM and application level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Architecture of configuration and information monitoring . . . . . . . . . . . . . . . . . 27

4.1 Actor lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Akka remote path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Structure of ReApper Akka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Overview of the distributed ReApper platform . . . . . . . . . . . . . . . . . . . . . . . . 464.5 Configurator architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.6 Executor initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.7 RAPL domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.8 MSR for RAPL power limiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.9 Gatherer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.10 MSR for RAPL energy status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.11 Power measuring device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.12 Components for actor migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.13 Actor migration process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.1 Migration process in detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Key–value store structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.3 Data-stream processing structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.4 Key–value store power consumption with power caps . . . . . . . . . . . . . . . . . . . . 775.5 Key–value store latency with power caps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.6 Key–value store sample run with different workloads . . . . . . . . . . . . . . . . . . . . 805.7 Key–value store sample run latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.8 Power consumption of the data-stream–processing application with power caps . . . 835.9 Data-stream processing sample run with different workloads . . . . . . . . . . . . . . . 84

93

L I S T O F TA B L E S

2.1 Device classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Explanation of MSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1 Heterogeneous hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

95

R E F E R E N C E S

[1] Arman Shehabi et al. “United States Data Center Energy Usage Report” (2016).

[2] James Hamilton. “Cooperative Expendable Micro-slice Servers (CEMS): Low Cost, LowPower Servers for Internet-scale Services.” In Proceedings of the Conference on Innovative DataSystems Research. 2009.

[3] Rajkumar Buyya, Anton Beloglazov, and Jemal Abawajy. “Energy-efficient Management ofData Center Resources for Cloud Computing: A Vision, Architectural Elements, and OpenChallenges.” In Proceedings of the International Conference on Parallel and Distributed ProcessingTechniques and Applications. 2010.

[4] Lizhe Wang and Samee Ullah Khan. “Review of Performance Metrics for Green Data Centers:A Taxonomy Study.” The Journal of Supercomputing 63 (2013).

[5] Philipp Haller and Stephen Tu. The Scala Actors API. URL: http://docs.scala-lang.org/overviews/core/actors.html (visited on 12/30/2016).

[6] Edward A. Lee. “The Problem with Threads” (2006).

[7] Herb Sutter and James Larus. “Software and the Concurrency Revolution” (2005).

[8] Carl Hewitt. “Viewing Control Structures as Patterns of Passing Messages.” Artificial Intelligence(1977).

[9] Carl Hewitt. “Actor Model of Computation: Scalable Robust Information Systems” (2010).

[10] Gul A. Agha. Actors A Model of Concurrent Computation in Distributed Systems. Tech. rep.1985.

[11] Gul A. Agha et al. “A Foundation for Actor Computation.” Journal of Functional Programming(1997).

[12] Gul A. Agha, Chris Houck, and Rajendra Panwar. “Distributed Execution of Actor Programs.”In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing.1991.

[13] Carlos A. Varela and Gul A. Agha. “Programming Dynamically Reconfigurable Open Systemswith SALSA.” ACM SIGPLAN Notices (2001).

[14] Rajesh K. Karmani, Amin Shali, and Gul A. Agha. “Actor Frameworks for the JVM Platform:A Comparative Analysis.” In Proceedings of the 7th International Conference on Principles andPractice of Programming in Java. ACM. 2009.

[15] Sergey Bykow et al. Orleans. Virtual Actors. URL: https://www.microsoft.com/en-us/research/project/orleans-virtual-actors/.

[16] Lightbend Inc. Akka Toolkit. 2011. URL: http://akka.io/ (visited on 12/30/2016).

97

http://docs.scala-lang.org/overviews/core/actors.html

http://docs.scala-lang.org/overviews/core/actors.html

https://www.microsoft.com/en-us/research/project/orleans-virtual-actors/

https://www.microsoft.com/en-us/research/project/orleans-virtual-actors/

http://akka.io/

REFERENCES

[17] Jonas Bonér. Akka Actor Kernel. RESTful, Distributed, Persistent, Transactional Actors. 2009.URL: https://github.com/akka/akka/tree/v0.5 (visited on 12/30/2016).

[18] Jonas Bonér. Introducing Akka. Simpler Scalability, Fault-Tolerance, Concurrency and Remotingthrough Actors. 2010. URL: http://jonasboner.com/introducing-akka/ (visited on12/30/2016).

[19] Philipp Haller. “On the Integration of the Actor Model in Mainstream Technologies: TheScala Perspective.” In Proceedings of the 2nd Edition on Programming Systems, Languages andApplications based on Actors, Agents, and decentralized Control Abstractions. 2012.

[20] Lightbend Inc. Akka Documentation. URL: http://doc.akka.io/docs/akka/current/(visited on 12/30/2016).

[21] Lightbend Inc. What is an Actor? URL: http://doc.akka.io/docs/akka/current/general/actors.html (visited on 12/30/2016).

[22] Lightbend Inc. Actors. Actor Lifecycle. URL: http://doc.akka.io/docs/akka/current/java/untyped-actors.html (visited on 12/30/2016).

[23] Lightbend Inc. Actor References, Paths, and Addresses. URL: http://doc.akka.io/docs/akka/current/general/addressing.html (visited on 12/30/2016).

[24] Lightbend Inc. Remoting. Lifecycle and Failure Recovery Model. URL: http://doc.akka.io/docs/akka/current/java/remoting.html (visited on 12/30/2016).

[25] Sergey Bykov et al. “Orleans: Cloud Computing for Everyone.” In Proceedings of the 2nd ACMSymposium on Cloud Computing. 2011.

[26] Vert.x. URL: http://vertx.io/ (visited on 02/10/2017).

[27] Reactor. URL: http://http://projectreactor.io/ (visited on 02/10/2017).

[28] Lightbend Inc. Play Framework. URL: https://www.playframework.com/ (visited on02/10/2017).

[29] The Apache Software Foundation. Apache Spark. URL: http://spark.apache.org/ (visitedon 02/10/2017).

[30] Urs Hölzle and Luiz Andreé Barroso. “The Datacenter as a Computer” (2009).

[31] Luiz André Barroso and Urs Hölzle. “The Case for Energy-proportional Computing.” Computer40 (2007).

[32] Alon Naveh et al. “Power and Thermal Management in the Intel® Core Duo Processor.” IntelTechnology Journal 10 (2006).

[33] Ed Grochowski and Murali Annavaram. “Energy per Instruction Trends in Intel Microproces-sors” (2006).

[34] Per Hammarlund et al. “Haswell: The Fourth-generation Intel Core Processor.” IEEE Micro(2014).

[35] Anant Deval, Avinash Ananthakrishnan, and Craig Forbell. “Power Management on 14nmIntel® Core-M Processor.” In Proceedings of the 18th IEEE Symposium in Low-Power andHigh-Speed Chips. 2015.

[36] Daniel Hackenberg et al. “Power Measurement Techniques on Standard Compute Nodes: AQuantitative Comparison.” In Proceedings of the IEEE International Symposium on PerformanceAnalysis of Systems and Software. 2013.

98

https://github.com/akka/akka/tree/v0.5

http://jonasboner.com/introducing-akka/

http://doc.akka.io/docs/akka/current/

http://doc.akka.io/docs/akka/current/general/actors.html

http://doc.akka.io/docs/akka/current/general/actors.html

http://doc.akka.io/docs/akka/current/java/untyped-actors.html

http://doc.akka.io/docs/akka/current/java/untyped-actors.html

http://doc.akka.io/docs/akka/current/general/addressing.html

http://doc.akka.io/docs/akka/current/general/addressing.html

http://doc.akka.io/docs/akka/current/java/remoting.html

http://doc.akka.io/docs/akka/current/java/remoting.html

http://vertx.io/

http://http://projectreactor.io/

https://www.playframework.com/

http://spark.apache.org/

REFERENCES

[37] Daniel Hackenberg et al. “An Energy Efficiency Feature Survey of the Intel Haswell Processor.”In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshop.2015.

[38] Advanced Micro Devices, Inc. AMD FX Processors Unleashed. A Guide to Performance Tuningwith AMD OverDrive and the New AMD FX Processors. URL: https://www.amd.com/Documents/AMD_FX_Performance_Tuning_Guide.pdf (visited on 02/10/2017).

[39] Karthick Rajamani et al. “Application-aware Power Management.” In Proceedings of the IEEEInternational Symposium on Workload Characterization. 2006.

[40] David Lo et al. “Towards Energy Proportionality for Large-scale Latency-critical Workloads.”ACM SIGARCH Computer Architecture News. 2014.

[41] Peter E. Bailey et al. “Adaptive Configuration Selection for Power-constrained HeterogeneousSystems.” In Proceedings of the 43rd International Conference on Parallel Processing. 2014.

[42] Kenan Liu, Gustavo Pinto, and Yu David Liu. “Data-oriented Characterization of Application-level Energy Optimization.” In Proceedings of the International Conference on FundamentalApproaches to Software Engineering. 2015.

[43] Vlasia Anagnostopoulou, Martin Dimitrov, and Kshitij A. Doshi. “SLA-guided Energy Savingsfor Enterprise Servers.” In Proceedings of the IEEE International Symposium on PerformanceAnalysis of Systems and Software. 2012.

[44] David Lo et al. “Heracles: Improving Resource Efficiency at Scale.” ACM SIGARCH ComputerArchitecture News. 2015.

[45] Akshat Verma, Puneet Ahuja, and Anindya Neogi. “Power-aware Dynamic Placement of HPCApplications.” In Proceedings of the 22nd Annual International Conference on Supercomputing.2008.

[46] Akshat Verma, Puneet Ahuja, and Anindya Neogi. “pMapper: Power and Migration CostAware Application Placement in Virtualized Systems.” In Proceedings of the ACM/IFIP/USENIXInternational Conference on Distributed Systems Platforms and Open Distributed Processing.2008.

[47] Andreas Merkel and Frank Bellosa. “Balancing Power Consumption in Multiprocessor Systems.”ACM SIGOPS Operating Systems Review. 2006.

[48] Constantine P. Sapuntzakis et al. “Optimizing the Migration of Virtual Computers.” ACMSIGOPS Operating Systems Review (2002).

[49] Kyong Hoon Kim, Anton Beloglazov, and Rajkumar Buyya. “Power-aware Provisioning ofCloud Resources for Real-time Services.” In Proceedings of the 7th International Workshop onMiddleware for Grids, Clouds and e-Science. 2009.

[50] Vivek Shrivastava et al. “Application-aware Virtual Machine Migration in Data Centers.” InProceedings of the IEEE INFOCOM. 2011.

[51] Timothy Wood et al. “Black-box and Gray-box Strategies for Virtual Machine Migration.” InProceedings of the 4th USENIX Conference on Networked Systems Design & Implementation.2007.

[52] Etienne Le Sueur and Gernot Heiser. “Dynamic Voltage and Frequency Scaling: The Laws ofDiminishing Returns.” In Proceedings of the International Conference on Power-aware Computingand Systems. 2010.

99

https://www.amd.com/Documents/AMD_FX_Performance_Tuning_Guide.pdf

https://www.amd.com/Documents/AMD_FX_Performance_Tuning_Guide.pdf

REFERENCES

[53] Oracle Corporation. ForkJoinPool API Documentation. URL: https : / / docs . oracle .com/javase/8/docs/api/java/util/concurrent/ForkJoinPool.html (visited on02/08/2017).

[54] Oracle Corporation. ThreadPoolExecutor API Documentation. URL: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html (visitedon 02/08/2017).

[55] Ashok Raj. CPU Hotplug Support in Linux(tm) Kernel. URL: https://www.kernel.org/doc/Documentation/cpu-hotplug.txt (visited on 02/06/2017).

[56] Martin Dimitrov. Intel Power Governor. 2012. URL: https://software.intel.com/en-us/articles/intel-power-governor (visited on 01/16/2017).

[57] Intel Corporation. “Intel 64 and IA-32 Architectures Developer’s Manual: Vol. 3B” (2016).

[58] Oracle Corporation. Java Native Interface. URL: https://docs.oracle.com/javase/8/docs/technotes/guides/jni/ (visited on 02/08/2017).

[59] Thomas Wilhalm, Roman Dementiev, and Patrick Fay. Intel Performance Counter Monitor. ABetter Way to Measure CPU Utilization. 2010. URL: http://www.intel.com/software/pcm(visited on 02/11/2017).

[60] Microchip Technology Inc. MCP39F501. URL: http://www.microchip.com/wwwproducts/en/MCP39F501 (visited on 02/11/2017).

[61] Oracle Corporation. Operating System Bean API Documentation. URL: https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html (visited on 02/11/2017).

[62] Sergei Gorlatch, Tim Humernbrum, and Frank Glinka. “Improving QoS in Real-time In-ternet Applications: From Best-effort to Software-defined Networks.” In Proceedings of theInternational Conference on Computing, Networking and Communications. 2014.

[63] Sergei Gorlatch and Tim Humernbrum. “Enabling High-level QoS Metrics for InteractiveOnline Applications using SDN.” In Proceedings of the International Conference on Computing,Networking and Communications. 2015.

[64] Scott Barielle. Calculating TCO for Energy. URL: http://www.ibmsystemsmag.com/mainframe/Business-Strategy/ROI/energy_estimating/?page=1 (visited on 03/26/2017).

[65] Dzmitry Kliazovich et al. “GreenCloud: A Packet-level Simulator of Energy-aware CloudComputing Data Centers.” In Proceedings of the Global Telecommunications Conference. 2010.

[66] Jeonghwan Choi et al. “Profiling, Prediction, and Capping of Power Consumption in Con-solidated Environments.” In Proceedings of the IEEE International Symposium on Modeling,Analysis and Simulation of Computers and Telecommunication Systems. 2008.

[67] Ricardo Koller, Akshat Verma, and Anindya Neogi. “WattApp: An Application Aware PowerMeter for Shared Data Centers.” In Proceedings of the 7th International Conference on AutonomicComputing. 2010.

[68] Shigeru Imai, Thomas Chestna, and Carlos A. Varela. “Elastic Scalable Cloud Computingusing Application-level Migration.” In Proceedings of the IEEE 5th International Conference onUtility and Cloud Computing. 2012.

[69] Andrew Newell et al. “Optimizing Distributed Actor Systems for Dynamic Interactive Services.”In Proceedings of the 11th European Conference on Computer Systems. 2016.

100

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ForkJoinPool.html

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ForkJoinPool.html

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html

https://www.kernel.org/doc/Documentation/cpu-hotplug.txt

https://www.kernel.org/doc/Documentation/cpu-hotplug.txt

https://software.intel.com/en-us/articles/intel-power-governor

https://software.intel.com/en-us/articles/intel-power-governor

https://docs.oracle.com/javase/8/docs/technotes/guides/jni/

https://docs.oracle.com/javase/8/docs/technotes/guides/jni/

http://www.intel.com/software/pcm

http://www.microchip.com/wwwproducts/en/MCP39F501

http://www.microchip.com/wwwproducts/en/MCP39F501

https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html



http://www.ibmsystemsmag.com/mainframe/Business-Strategy/ROI/energy_estimating/?page=1

http://www.ibmsystemsmag.com/mainframe/Business-Strategy/ROI/energy_estimating/?page=1