A Deep Reinforcement Learning Approach for Data Migration ...€¦ · A Deep Reinforcement Learning Approach for Data Migration in Multi-access Edge Computing 26-28 November Santa

A Deep Reinforcement Learning

Approach for Data Migration in

Multi-access Edge Computing

26-28 November

Santa Fe, Argentina

Dario Bruneo

University of Messina, Italy

[email protected]

26-28 November

Santa Fe, Argentina

Smart services in Smart environments

• IoT diffusion

• Cloud-based approach

– What about latency?

http://callmeramzo.in

26-28 November

Santa Fe, Argentina

Multi-access Edge Computing (MEC)

• ETSI standard

• placing nodes with computation capabilities, MEC servers, close to the elements of the network edge

• MEC Vs. Fog Computing

– explicit interaction with network elements

– (network) information gathering

26-28 November

Santa Fe, Argentina

A 5G MEC-enabled LTE scenario

26-28 November

Santa Fe, Argentina

Challenges

• Resource allocation

• Application migration

(App Containerization)

• Proactive Vs. Reactive

approaches

AI-based techniques Machine Learning

26-28 November

Santa Fe, Argentina

Reinforcement Learning (RL)

• Learning through a trial and error process

• Best choice to solve decision making problems

• Markov Decision Process (MDP) formalism

• Q-Learning (model free approach)

648 Chapter 17. Making Complex Decisions

1 2 3

1

2

3 + 1

–1

4

–1

+1

R(s) < –1.6284

(a) (b)

– 0.0221 < R(s) < 0

–1

+1

–1

+1

–1

+1

R(s) > 0

– 0.4278 < R(s) < – 0.0850

Figure 17.2 (a) An optimal policy for the stochastic environment with R(s) = − 0.04 in

the nonterminal states. (b) Optimal policies for four different ranges of R(s).

and (3,3) are as shown, every policy is optimal, and the agent obtains infinite total reward be-

cause it never enters a terminal state. Surprisingly, it turns out that there are six other optimal

policies for various ranges of R(s); Exercise 17.5 asks you to find them.

The careful balancing of risk and reward is a characteristic of MDPs that does not

arise in deterministic search problems; moreover, it is a characteristic of many real-world

decision problems. For this reason, MDPs have been studied in several fields, including

AI, operations research, economics, and control theory. Dozens of algorithms have been

proposed for calculating optimal policies. In sections 17.2 and 17.3 we describe two of the

most important algorithm families. First, however, we must complete our investigation of

utilities and policies for sequential decision problems.

17.1.1 Utilities over time

In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of

rewards for the states visited. This choice of performance measure is not arbitrary, but it is

not the only possibility for the utility function on environment histories, which we write as

Uh([s0, s1, . . . , sn ]). Our analysis draws on multiattribute utility theory (Section 16.4) and

is somewhat technical; the impatient reader may wish to skip to the next section.

The first question to answer is whether there is a finite horizon or an infinite horizonFINITEHORIZON

INFINITEHORIZON for decision making. A finite horizon means that there is a fixed time N after which nothing

matters—the game is over, so to speak. Thus, Uh([s0, s1, . . . , sN + k ]) = Uh([s0, s1, . . . , sN ])

for all k > 0. For example, suppose an agent starts at (3,1) in the 4× 3 world of Figure 17.1,

and suppose that N = 3. Then, to have any chance of reaching the +1 state, the agent must

head directly for it, and the optimal action is to go Up. On the other hand, if N = 100,

then there is plenty of time to take the safe route by going Left. So, with a finite horizon,

Russel-Norvig Arificial Intelligence: A modern approach – Prentice Hall

648 Chapter 17. Making Complex Decisions

1 2 3

1

2

3 + 1

–1

4

–1

+1

R(s) < –1.6284

(a) (b)

– 0.0221 < R(s) < 0

–1

+1

–1

+1

–1

+1

R(s) > 0

– 0.4278 < R(s) < – 0.0850

Figure 17.2 (a) An optimal policy for the stochastic environment with R(s) = − 0.04 in

the nonterminal states. (b) Optimal policies for four different ranges of R(s).

and (3,3) are as shown, every policy is optimal, and the agent obtains infinite total reward be-

cause it never enters a terminal state. Surprisingly, it turns out that there are six other optimal

policies for various ranges of R(s); Exercise 17.5 asks you to find them.

The careful balancing of risk and reward is a characteristic of MDPs that does not

arise in deterministic search problems; moreover, it is a characteristic of many real-world

decision problems. For this reason, MDPs have been studied in several fields, including

AI, operations research, economics, and control theory. Dozens of algorithms have been

proposed for calculating optimal policies. In sections 17.2 and 17.3 we describe two of the

most important algorithm families. First, however, we must complete our investigation of

utilities and policies for sequential decision problems.

17.1.1 Utilities over time

In the MDP example in Figure 17.1, the performance of the agent was measured by a sum of

rewards for the states visited. This choice of performance measure is not arbitrary, but it is

not the only possibility for the utility function on environment histories, which we write as

Uh([s0, s1, . . . , sn ]). Our analysis draws on multiattribute utility theory (Section 16.4) and

is somewhat technical; the impatient reader may wish to skip to the next section.

The first question to answer is whether there is a finite horizon or an infinite horizonFINITEHORIZON

INFINITEHORIZON for decision making. A finite horizon means that there is a fixed time N after which nothing

matters—the game is over, so to speak. Thus, Uh([s0, s1, . . . , sN + k ]) = Uh([s0, s1, . . . , sN ])

for all k > 0. For example, suppose an agent starts at (3,1) in the 4× 3 world of Figure 17.1,

and suppose that N = 3. Then, to have any chance of reaching the +1 state, the agent must

head directly for it, and the optimal action is to go Up. On the other hand, if N = 100,

then there is plenty of time to take the safe route by going Left. So, with a finite horizon,

0.8

0.1

0.1

uncertain system

26-28 November

Santa Fe, Argentina

Deep RL

• Q-learning does not converge

when the number of states is too

large (e.g., 1020)

• Deep RL introduced by DeepMind

• Using a Deep Neural Network

(DNN) to predict the Q-values for

a given state

States Actions

26-28 November

Santa Fe, Argentina

Applying Deep RL to MEC LTE scenarios

machine learning frameworks like: TensorFlow, Theano and

others. The main feature of this library is its simplicity, that

allows to build complex neural network topologies with just

a few lines of code in a scikit-lear n fashion, but keeping

at the same time the power of the neural network engine

that runs underneath. Using Keras, we built a feedforward

fully connected deep neural network composed of n hidden

layers in between the input layer whose dimension is given

by the cardinality kT k of the state t-uple, and the output

layer whose dimension is given by the cardinality kZ k of

the action set. In order to integrate the two systems which

run respectively on Python and on C++ environments, we

implemented a mechanism to let them communicate using

text files. With reference to Fig. 2, the deep RL engine

waits for the generation of the files containing the current

state of the system, once it receives the data it generates as

output a text filewhich contains the action to execute on the

simulator. On theOMNeT++ side, weused an asynchronous

timer whichchecksperiodically for theaction fileavailability,

assoon asthefile isavailable, thesimulator isableto read the

action code and change the server destination address for the

UEs that are running the application indicated in the action

thus emulating the data migration of the application from a

server to another. After the action execution, the RL agent

observes the reward obtained as the combination of several

performance indexes provided by the OMNeT++ simulator

by checking if the action performed has increased it or not.

The reward in this sense is used by the agent as a feedback

which helps it to understand if the action executed is a valid

choice in that specific system state.

5. RESULTS

In thissection, wepresent apreliminary scenario that webuilt

to test thefeasibility of thesystem whereweonly consider the

presence of MEC servers without the possibility to use the

Cloud. Fig.4 showsthestructureof thenetwork composed of

three eNBs aset of devices with K = 9, aset of MEC servers

with N = 3, a set of applications with M = 3, and a set of

actions with Z = kN k · kM k where each action corresponds

to themigration of an app taken from theAppsset to aserver

taken from the MEC set. The datarate connection provided

by the cables which connect the eNBs is equal to 10 Gbps

except for theonesthat connect therouters to thePGW where

the datarate is 3 Mbps to emulate a traffic congestion, thus

creating areal scenario wherewecan test theperformanceof

our algorithm. On the OMNeT++ side, it is possible to set

several parameters for thesimulation by using aconfiguration

filecalled omnetpp.ini ; since thenumber of parameters to set

is very large, we synthesized them in a table.

Table 1 shows themain parameters weset for the simulation,

weconsider atotal number of nineuserswho follow arandom

mobility motion pattern moving at speed equal to 1.5 mps

which is a fairly good approximation for the human walking

speed. With respect to the applications, we consider three

constant bit rate applications (CBR) that can be run by only

oneMEC server per time. Asalready said at thebeginning of

this section, our goal is to test the feasibility of the technique

Configuration Parameters

Number o f user s 9

User mobi l i ty RandomWayPointMobi l i ty

User speed 1.5 mps

Number o f appl ications 3

Appl ication type UDP ConstantBitRate

Packet si ze 1500B

Simulation Time 420 seconds

Table 1 – omnetpp.ini configuration.

we are proposing, for this reason, we decided to realize a

scenario with manageable number of users and applications

in order to keep the training time of the DNN not too high.

In such a context, the state which defines the network can be

expressed as follows:

Figure 4 –OMNeT++/SimuLTE simulation scenario.

s =< (UEeN B1,UEeN B2,UEeN B3,

eNB1app1, eNB1

app2, eNB1app3,

eNB2app1, eNB2

app2, eNB2app3,

eNB3app1, eNB3

app2, eNB3app3,

Mec1app1, Mec1

app2, Mec1app3,

Mec2app1, Mec2

app2, Mec2app3,

Mec3app1, Mec3

app2, Mec3app3) >

(10)

where:

• UEeN B j represents the number of devices connected to

the j-th eNB;

• eNBj

appkrepresents the number of devices which are

running the k-th application in the j-th eNB;

• MECiappk

isaboolean flag that indicates if the i-thMEC

server is running the k-th application.

With respect to the reward, we first defined as a

QoS performance index the percentage of received data

corresponding to the i-th application appi as:

Dappi =ReceivedT H R

ÕSentT H R · packetSize

(11)

where the sum is extended to all the UEs that run the i-th

application. In particular, we evaluated the average of the




a few lines of code in a scikit-learn fashion, but keeping

















action code and change theserver destination address for the










5. RESULTS





threeeNBsaset of devices with K = 9, aset of MEC servers


actions with Z = kN k · kM k whereeach action corresponds




except for theonesthat connect theroutersto thePGW where






isvery large, wesynthesized them in a table.

Table1 showsthemain parameters weset for thesimulation,









Number of user s 9

User mobi l ity RandomWayPointMobil i ty

User speed 1.5 mps

Number of appl ications 3


Packet si ze 1500B





in order to keep the training time of theDNN not too high.

In such acontext, the state which defines the network can be


Figure4 –OMNeT++/SimuLTE simulation scenario.


eNB1app1, eNB1

app2, eNB1app3,

eNB2app1, eNB2

app2, eNB2app3,

eNB3app1, eNB3

app2, eNB3app3,

Mec1app1, Mec1

app2, Mec1app3,

Mec2app1, Mec2

app2, Mec2app3,

Mec3app1, Mec3

app2, Mec3app3) >

(10)

where:

• UEeN B jrepresents the number of devices connected to

the j-th eNB;

• eNBj



• MECiappk

isabooleanflag that indicates if the i-thMEC






ÕSentT H R · packetSize

(11)



State user position and app distribution

Actions app migration Reward combinationof network performance indexes


others. Themain feature of this library is its simplicity, that















output a text filewhich contains theaction to execute on the


timer whichchecksperiodically for theactionfileavailability,

assoonasthefileisavailable, thesimulator isabletoreadthe

action codeand changetheserver destination address for the








which helps it to understand if theaction executed isavalid


5. RESULTS

Inthissection, wepresent apreliminary scenariothat webuilt

totest thefeasibility of thesystemwhereweonly consider the


Cloud. Fig.4showsthestructureof thenetwork composed of

threeeNBsaset of deviceswith K = 9, aset of MECservers


actionswith Z = kN k ·kM kwhereeachaction corresponds

to themigration of anapp taken fromtheAppsset toaserver



except for theonesthat connect therouterstothePGWwhere


creating areal scenariowherewecan test theperformanceof


several parametersfor thesimulation by usingaconfiguration

filecalled omnetpp.ini; sincethenumber of parameters toset

isvery large, wesynthesized them in atable.

Table1showsthemain parametersweset for thesimulation,

weconsider atotal number of nineuserswhofollowarandom


which isa fairly good approximation for thehuman walking



oneMECserver per time. Asalready saidat thebeginningof

thissection, our goal is to test thefeasibility of thetechnique

ConfigurationParameters

Number of users 9

User mobility RandomWayPointMobil ity

User speed 1.5 mps

Number of applications 3

Application type UDP ConstantBitRate

Packet size 1500B

SimulationTime 420 seconds

Table1–omnetpp.ini configuration.



in order to keep thetraining timeof theDNN not too high.

In such acontext, thestatewhichdefines thenetwork can be

expressed asfollows:

Figure4–OMNeT++/SimuLTE simulation scenario.


eNB1app1,eNB1

app2, eNB1app3,

eNB2app1,eNB2

app2, eNB2app3,

eNB3app1,eNB3

app2, eNB3app3,

Mec1app1, Mec1

app2, Mec1app3,

Mec2app1, Mec2

app2, Mec2app3,

Mec3app1, Mec3

app2, Mec3app3) >

(10)

where:

• UEeN B jrepresents thenumber of devicesconnected to

the j-th eNB;

• eNBj


running thek-th application in the j-th eNB;

• MECiappk

isabooleanflag that indicatesif thei-thMEC

server isrunning thek-th application.




Dappi =ReceivedTH R

ÕSentTH R ·packetSize

(11)



set of MEC/Cloud servers

set of Applications































5. RESULTS











except for theonesthat connect theroutersto thePGW where







Table 1 showsthe main parameters weset for the simulation,









Number o f user s 9


User speed 1.5 mps



Packet si ze 1500B










eNB1app1, eNB1

app2, eNB1app3,

eNB2app1, eNB2

app2, eNB2app3,

eNB3app1, eNB3

app2, eNB3app3,

Mec1app1, Mec1

app2, Mec1app3,

Mec2app1, Mec2

app2, Mec2app3,

Mec3app1, Mec3

app2, Mec3app3) >

(10)

where:

• UEeN B jrepresents the number of devices connected to

the j-th eNB;

• eNBj



• MECiappk

isaboolean flag that indicates if the i-th MEC






ÕSentT H R · packetSi ze

(11)






a few lines of code in a scikit-lear n fashion, but keeping














simulator. On the OMNeT++ side, weused an asynchronous


assoon asthefile isavailable, thesimulator isable to read the











5. RESULTS




Cloud. Fig.4 showsthestructure of thenetwork composed of




to themigration of an app taken from the Appsset to aserver



except for theonesthat connect therouters to thePGW where


creating areal scenario wherewecan test theperformance of





Table 1 shows the main parameters weset for the simulation,









Number o f user s 9


User speed 1.5 mps



Packet si ze 1500B










eNB1app1, eNB1

app2, eNB1app3,

eNB2app1, eNB2

app2, eNB2app3,

eNB3app1, eNB3

app2, eNB3app3,

Mec1app1, Mec1

app2, Mec1app3,

Mec2app1, Mec2

app2, Mec2app3,

Mec3app1, Mec3

app2, Mec3app3) >

(10)

where:

• UEeN B j represents the number of devices connected to

the j-th eNB;

• eNBj



• MECiappk

isaboolean flag that indicates if the i-th MEC






ÕSentT H R · packetSi ze

(11)



Percentage of receiveddata

26-28 November

Santa Fe, Argentina

The proposed algorithm

Init

actionselection

actionexecution

DNN training

26-28 November

Santa Fe, Argentina

Creating a Deep RL environment for MEC

26-28 November

Santa Fe, Argentina

MEC-LTE environment

• OMNeT++– iNet

– SimuLTE MEC extension

26-28 November

Santa Fe, Argentina

Deep RL engine

• Keras on top of TensorFlow

– build complex neural network

topologies with just a few lines of

code

– keeping the power of the neural

network engine that runs underneath

OMNeT++ (C++) and Keras (Python) have been integrated by implementing a mechanism to let them communicate using text files

26-28 November

Santa Fe, Argentina

Experimental results

• 3 eNBs

• 9 UE

• 3 MEC Servers

• 3 Applications (CBR)

• Random walk

(walking speed)

• Training for 25,000

simulation seconds 0

0.2

0.4

0.6

0.8

1

100 200 300 400

Pe

rce

nta

ge o

f re

ce

ive

d d

ata

- D

Simulation time (sec)

No policyDeep RL

The Deep RL algorithm is ableto promptlyreact to wrongactions

The Deep RL algorithmoutperforms the «static» policy

Comparison with a «static» policy where no App migration is performed

26-28 November

Santa Fe, Argentina

Conclusions and Future Work

• We presented a machine learning approach to address the problem related to

the network environment dynamics in a 5G MEC-enabled LTE scenario

• We designed a Deep RL algorithm and tested it in a real scenario demonstrating

the feasibility of the technique

• Future works will be devoted to:

– better integration between OMNeT++/SimuLTE and Keras/TensorFlow

– analysis of more complex scenarios

– comparison with other solutions

Thank you

Documents

A Deep Reinforcement Learning Approach for Data Migration ...€¦ · A Deep Reinforcement Learning Approach for Data Migration in Multi-access Edge Computing 26-28 November Santa