arXiv:1711.10574v2 [cs.AI] 5 Apr 2018 · 2018-04-06 · from past experiences for which how to stay interconnected ... live in colonies; such as ants, bees, bird ﬂocks etc. have

A reinforcement learning algorithm for building collaboration inmulti-agent systems

Mehmet E. Aydin*Department of Computer Science and Creative

TechnologiesUniversity of the West of England

Frenchay Campus, Bristol, [email protected]

Ryan FellowsDepartment of Computer Science and Creative

TechnologiesUniversity of the West of England

Frenchay Campus, Bristol, [email protected]

Abstract— This paper presents a proof-of concept study fordemonstrating the viability of building collaboration amongmultiple agents through standard Q learning algorithm embed-ded in particle swarm optimisation. Collaboration is formulatedto be achieved among the agents via some sort competition,where the agents are expected to balance their action insuch a way that none of them drifts away of the team andnone intervene any fellow neighbours territory. Particles aredevised with Q learning algorithm for self training to learnhow to act as members of a swarm and how to producecollaborative/collective behaviours. The produced results aresupportive to the algorithmic structures suggesting that asubstantive collaboration can be build via proposed learningalgorithm.

I. INTRODUCTION

Cutting-edge technologies facilitates the daily-life of in-dividuals and societies with more opportunities to overcomechallenging issues continuously introducing new smart gad-gets day-in day-out. These astonishing technologies intro-duce changes with use smart sensors in most of the time,which places a crucial role in our daily life as they areliterally everywhere any more. Internet of Things (IoT) isone of key technologies to organise smart sensors in orderto facilitate living environments with more and more servicessuch as Smart homes and cities, highly-efficient engineeringproducts, crews/swarms of robots etc. A particular examplecan be a swarm of unmanned aerial vehicles (UAVs) whichare teamed up to collect information from disaster areas topredict/discover and help identify the impact of damage andthe level of human suffering. This is due to the fact thatinformation collection plays a very crucial role in disastermanagement, where the decisions are required to be donetimely and based on correct and up-to-date information.Swarms of UAVs can be devised for this purposes, whichare expected to remain inter-connected all the time to deliverthe duties collaboratively [3]. Obviously, this is a typicalimplementation area of IoT, where smart sensors and tinydevices, which are drones (UAVs) in this case, requireefficient and robust settings and configuration. However, anefficiently exploring swarm is not easy to design and rundue to various practical issues such as energy limitations.This paper introduces a novel learning algorithm to train in-

*Corresponding author

dividual devices to make smartly behaving and collaboratingentities.

Multi-agent systems (MAS) is an up-to-date artificialintelligence paradigm , which attracts much attention formodeling intelligent solutions in rather a distributed form.It imposes formulating limited capacity items as proactiveand smart entities, which autonomously act and accumulateexperience to exploit ahead in fulfilling duties more and moreefficiently. In this way, a more comprehensive and collectiveintelligence can be achieved. This paradigm has proved suc-cess so many times in a wider problem solving horizon [4],[19], [6], [24]. This proves that developing IoT modelsusing MAS paradigm will produce a substantial benefit andefficiency. However, building a collaboration among multipleagents remains challenging since MAS studies have notreached to a sufficient level of maturity due to the difficultyin the nature of the problem. The remaining parts of thispaper introduce a novel learning algorithm implemented formultiple agent models, where a collaboration is aimed to beconstructed among the participating agents via the introducedalgorithm. It is a reinforcement learning algorithm, whichbest fits real-time learning cases, and dynamically changingenvironments. The individual agents are expected to learnfrom past experiences for which how to stay interconnectedand remain as a crew to collectively fulfill the duties withoutwasting resources. The latter purpose enforces the individualagents to compete in achieving higher rewards through out ofthe entire process, which makes the study further importantsince collaboration has to be achieved while competing. Pre-viously, competition-based collective learning algorithm hasbeen attempted with learning classifier systems for modellingsocial behaviours [17]. Although there are many other studiesconducted for collective learning of multi-agents with Qlearning [14], [28], the proposed algorithm implements acompetition-based collective learning algorithm extending Qlearning with the notion of individuals and their positions inparticle swarm optimisation (PSO) algorithm, which ends upas Q learning embedded in PSO.

The rest of the paper consists of the following struc-ture; the background and literature review is presented inSection II, the proposed reinforcement learning algorithmis introduced in Section III, the implementation of thealgorithm for scanning fields is elaborated in Section IV,

arX

iv:1

711.

1057

4v2

[cs

.AI]

5 A

pr 2

018

experimental results and discussions are detailed in Section Vand finally conclusions in Section VI.

II. BACKGROUND

A. Swarm Intelligence

Swarm intelligence is referred to artificial intelligence (AI)systems where an intelligent behaviour can emerge as the theself-organised outcome of a collection of simple entities suchas agents, organisms or individuals. Simple organisms thatlive in colonies; such as ants, bees, bird flocks etc. have longfascinated many people for their collective intelligence andemergent behaviours that is manifested in many of activitiesthey do. A population of such simple entities can interactwith each other as well as with their environment withoutusing any set of instruction(s) to proceed, and compose aswarm intelligence system [21], .

The swarm intelligence approaches are to reveal the collec-tive behaviour of social insects in performing specific duties;it is about modelling the behaviour of those social insects anduse these models as a basis upon which varieties of artificialentities can be developed. In such a way, the problemscan be solved by models that exploit the problem solvingcapabilities of social insects. The motivation is to model thesimple behaviours of individuals and the local interactionswith the environment and neighbouring individuals, in orderto obtain more complex behaviours that can be used to solvecomplex problems, mostly optimisation problems [11], [31]).

B. Reinforcement Learning

Reinforcement learning (RL) is a class of learning inwhich unsupervised learning rules work alongside with areinforcement mechanism to reward an agent based on itsaction selection activity to respond the stimulus from itsown environment. It can be also called as semi-supervisedlearning since it receives a reinforcement point, either im-mediate or delayed, fed back from the environment. Let Λbe an agent works in environment E, which stimulates Λwith its state s ∈ S, where S is the finite set of states of E.The agent Λ will evaluate this perceived state and make adecision to select an action a ∈ A, where A is the finite set ofactions that an agent can take. Meanwhile, the reinforcementmechanism, may also be called as reward function, assessesthe action, a, taken by Λ in response to state s and producesreward r to feed back to Λ. Here, the ultimate aim of theagent Λ is to maximize its accumulated reward by the end ofthe learning period/process, as in the following expression:

max R =

∞∑i=1

ri (1)

where∞ is practically replaced with a finite number such asI to be the total number of learning iterations. Although anagent is theoretically expected to function forever, it usuallyworks for a predefined time period as a matter of practicality.

There are various reinforcement learning methods devel-oped with various properties. Among these, Q Learning [36],[33], TD Learning [7],[32], learning classifier systems [8],[9] etc are well know reinforcement learning approaches.

C. Collaboration in multi agent systems

Multi agent systems (MAS) are well- known and relativelymature distributed collective intelligence approaches withwhich a set of proactive agents act individually for solvingthe problems in collaboration [2]. The main theme isto team up intelligent autonomous entities for solving theproblems in harmony and composing a certain level ofcoordination to help individual agents act proactively andefficiently to contribute and collaborate in problem solvingprocess demonstrating individual intelligence capacity [4]. Itis useful to note that the main properties of MAS (i.e. auton-omy, responsiveness, redundancy, and distributed approach)facilitate success in MAS applications, which result in agood record in implementations within many research fieldsincluding production planning, scheduling and control [27],engineering design, and process planning [2].

The concept of metaheuristic agents has recently beenidentified to describe a particular implementation of multiagent systems devised to tackle hard optimisation problems.The idea is to build up teams of individual agents equippedwith metaheuristic problem solvers aiming to solve hardand large-scale problems with distributed and collaborativeintelligent search skills. In the literature, few multi agentsystems implementing metaheuristics are introduced andoverviewed with respect to their performances [6], [16]while it is known that metaheuristic approaches are, by large,used as standalone applications.

Researchers are conscious on that solving complex andlarge problems with distributed approaches remains as achallenging issue due to the fact that there is not a pro-ductive method to commonly use for organising distributedintelligence (agents in this case) for a high efficiency. [23],[28], [34]. In fact, the performances of multi- agent systemsincluding metaheuristic teams significantly depends on thequality of collaboration [6]. Swarm intelligence-based agentcollaboration is suggested in [4], while the persistence ofthis challenging issue is reflected in a number of recentstudies including [15] and [12], where [15] introducesauction-based consensus among the agents while [12] studiestheoretical bases of agent collaboration through mathematicalfoundations.

III. MODELLING WITH SWARMS OF LEARNING AGENTS

A. Q learning

Q learning is a reinforcement learning algorithm thatis developed based on temporal-difference handled withasynchronous dynamic programming. It provides rewardsfor agents with the capability of learning to act optimallyin Markovian domains by experiencing the consequencesof actions, without requiring them to build map of therespective domain [35]. The main idea behind Q learningis to use a single data structure called the utility function(Q(x, a)). That is the utility of performing action a in state x[36]. Throughout the whole learning process, this algorithmupdates the value of Q(x, a) using x, a, r, y tuples per step,where r represents the reinforcement signal (payoff) of the

environment and y represents the new state which is obtainedas the consequence of executing action a in state x. Both xand y are elements of the set of states (S) and a is an elementof the set of actions (A). Q(x, a) is defined as:-

Q : S ×A −→ < (2)

and determined as:-

Q(x, a) = E(r + γe(y)|x, a) (3)

where γ is a discounted constant value within the interval of[0,1] as described according to the domain and e(y) is theexpected value of y defined as:

e(y) = max{Q(y, a)} for ∀a ∈ A (4)

The learning procedure first initialises the Q values to 0for each action. It then repeats the following procedure. Theaction with the maximum Q value is selected and activated.Corresponding Q value of that action is then updated usingthe following equation (updating rule):-

Qt+1(x, a) = Qt(x, a) + β(r + γe(y)−Qt(x, a)) (5)

where Qt(x, a) and Qt+1(x, a) are the old and the new Qvalues of action a in state x, respectively. β is the learningcoefficient changing in [0,1] interval. This iterative processends when an acceptable level of learning is achieved ora stopping criterion is satisfied. For more information seeSutton and Barto [30].

B. Particle swarm optimisation (PSO)

PSO is a population-based optimization technique inspiredof social behaviour of bird flocking and fish schooling. PSOinventors have implemented such natural processes to solvethe optimization problems in which each single solution,called a particle, joins the other individuals to make up aswarm (population) for exploring within the search space.Each particle has a fitness value calculated by a fitnessfunction, and a velocity of moving towards the optimum. Allparticles search across the problem space following the parti-cle nearest to the optimum. PSO starts with initial populationof solutions, which is updated iteration-by-iteration. A basicPSO algorithm builds each particle based on, mainly, two keyvectors; position vector, xi(t) = {xi,1(t), ..., xi,n(t)}, andvelocity vector vi(t) = {vi,1(t), ..., vi,n(t)}, where xi,k(t),is the position value of the ith particle with respect to thekth dimension (k = 1, 2, 3, .., n) at iteration t, and vi,k(t) isthe velocity value of the ith particle with respect to the kth

dimension at iteration t. The initial values, xi(0) and vi(0),are given by

xi,k(0) = xmin + (xmax − xmin)× r1, (6)vi,k(0) = vmin + (vmax − vmin)× r2, (7)

where xmin, xmax, vmin, vmax are lower and upper limitsof the ranges of position and velocity values, respectively,and finally, r1 and r2 are uniform random numbers within[0, 1]. Since both vectors are continuous, the original PSO

algorithm can straightforwardly be used for continuous opti-mization problems. However, if the problem is combinatorial,a discrete version of PSO needs to be implemented. Once asolution is obtained, the quality of that solution is measuredwith a cost function denoted with fi, where fi : xi(t) −→ <.

For each particle in the swarm, a personal best, yi(t) ={yi,1(t), ..., yi,n(t)}, is defined, where yi,k(t) denotes theposition of the ith personal best with respect to the kth

dimension at iteration t. The personal bests are equal tothe corresponding initial position vector at the beginning.Then, in every generation, they are updated based on thesolution quality. Regarding the objective function, fi, thefitness values for the personal best of the ith particle, yi(t),is denoted by fyi (t) and updated whenever fyi (t + 1) ≺fyi (t), where t stands for iteration and ≺ corresponds to thelogical operator, which becomes < or > for minimization ormaximization problems respectively.

On the other hand, a global best, which is the bestparticle within the whole swarm is defined and selectedamong the personal bests, y(t), and denoted with g(t) ={g1(t), ..., gn(t)}. The fitness of the global best, fg(t), canbe obtained using:

fg(t) = opti∈N{fyi (t)} (8)

where opt becomes min or max depending on the typeof optimization. Afterwards, the velocity of each particle isupdated based on its personal best, yi(t) and the global best,g(t) using the following updating rule:

vi(t+ 1) = δwt∆vi(t) (9)∆vi = c1r1(yi(t)− xi(t)) + c2r2(g(t)− xi(t)) (10)

where w is the inertia weight used to control the impactof the previous velocities on the current one, which isdecremented by β, decrement factor, via wt+1 = wt×β, δ isconstriction factor which keeps the effects of the randomizedweight within the certain range. In addition, r1 and r2 arerandom numbers in [0,1] and c1 and c2 are the learningfactors, which are also called social and cognitive parameters.The next step is to update the positions with:

xi(t+ 1) = xi(t) + vi(t+ 1) (11)

for continues problem domains. On the other hand, sincediscrete problems cannot be solved in the same way ofcontinuous problems, various discrete PSO algorithms havebeen proposed. Among these, Kennedy and Eberhart [13]have proposed the most used one, which mainly createsbinary position vector based on velocities as follows:

xi(t+ 1) =1

evi(t+1). (12)

After getting position values updated for all particles,the corresponding solutions with their fitness values arecalculated so as to start a new iteration if the predeterminedstopping criterion is not satisfied. For further information,[20] and [31] can be seen.

C. Swarms of Learning Agents

PSO is one of very well know swarm intelligence algo-rithms used to develop collective behaviours and intelligenceinspiring of bird flocks. Although it has a good record ofsuccess, learning capability remains an important aspect tobe developed further for an improved intelligence. There arefew studies investigating the hybridisation of reinforcementlearning algorithms, especially Q Learning algorithm imple-mented for particular applications [10], [22], [26]. Likewise,Q Learning algorithm has been implemented by variousstudies to develop coordination of multi agent systems [18].However, PSO has not been integrated with Q Learningin order to make each particle within the swarm towardslearning for collaboration.

For the purpose of training the particles of the swarm tobehave in harmony within its neighbourhood, we proposeuse of Q Learning algorithm in building intelligent searchbehaviour of each individual. A Q Learning algorithm isembedded in PSO in a way that the position vectors, xi,is updated subject to a well-designed implementation of Qlearning to adaptively control the behaviour of the individualstowards collective behaviours, where all individual membersof the swarm collectively and intelligently contribute. Hence,we revised PSO, first, with ignoring the use of velocityvector, vi, so as to save time and energy relaying on thefact that the position vector, xi, inherently contains vi, anddoes not necessitate its use [29], [5]. Secondly, the updaterule of the position vectors, xi, (Eq: 11) is revised as follows:

xi(t+ 1) = xi(t) + f(Q, xi, a) (13)f(Q, xi, a) = {x̂i|max[Q(xi, a)] for ∀a ∈ A} (14)

where x̂i is a particular position vector obtained fromf(Q, xi, a) in which action a is taken since it has thehighest utility value, Q, returned. The main aim of eachindividual/particle is to learn from the experiences gainedonce each receives the reward produced by reinforcementmechanism with crediting the action rightly taken and pun-ishing the wrongly taken ones. This learning property to beincrementally developed by each particle will succeed to awell-designed collective behaviour.

D. Reinforcement Mechanism

As clearly indicated before, reinforcement mechanismplays the crucial role in furnishing particles with learn-ing capabilities. It remains as an independent monitoringmechanism to assess the actions taken by the particles andsupply them with reinforcing payoff grades. It is usuallyimplemented in a Reward Function, which is defined asfollows:

R : S ×A −→ R (15)

The reward function is implemented to consider the situationwith a particular state, x, applied with action a, whether it isor not the correct action taken. A reward, r, will be producedas the assessment level for the situation. Thus, an efficientreward function will be developed based on the problemdomain.

Fig. 1: A typical urban landscape

IV. SCANNING DISASTROUS AREA WITH SWARM OFLEARNING AGENTS

This problem case is adopted to illustrate the implementa-tion of collective intelligence achieved using the multi agentlearning algorithm proposed in this study, which is builtup through embedding Q learning within particle swarmoptimisation algorithm. Fig 1 illustrates a simple scenarioin which a typical piece of land combining rural and urbanareas to be scanned by a swarm of learning agents. Supposethat such an area subjected to some disasters is requiredto be scanned for information collection purposes. A flockof artificial birds (swarm of UAVs); each is identified as aparticle and furnished with a list of actions to take whilemoving around the area in collaboration with other peerparticles. Each particle is enabled to learn via the Q learningimplemented for this purpose and being trained how toremain connected with the rest of the swarm. The logic isimplemented to identify if a particle is collaborating or notas demonstrated in Fig. 2.

Two possible cases are illustrated in Fig. 2. As indicated,teams of particles (flocks of birds) can remain interconnectedfor collaboration if each is sufficiently close to another peerparticle, which is measured with Euclidean distance that is

Fig. 2: Connecting individuals via distance

particularly calculated in a circle-centric way. A particle isconsidered connected if remains within a circle with particu-lar radius, but will be out of connection if remains out of thecircle of that radius. In addition, the particles are expectednot to approach to each other beyond a certain distance, thenthey will also be counted not well-collaborating since theyoverlap and cause wasting resources. The main idea behindthis algorithm is to train the particles not to fall apart andnot to overlap, either. That is the main objective to achieve.

A. Embedding Q learning within each particle

Since the swarm intelligence framework preferred in thisstudy is PSO, each individual to form up the swarm will beidentified as a particle as is in particle swarm optimisation.Let M be the size of the swarm, where M particles arecreated to form up the swarm; each has a 2-dimensionalposition vector, xi = {x1,i, x2,i| i = 1, ...,M}, becausethe defined area is 2-dimensional and each particle willsimply move forward and/or backward, vertically and/orhorizontally. For simplification purposes, each particle isallowed to move with selecting one of predefined actions,where each action is defined as a step in which the particlecan chose the size of the step only. Using the same notationas Q learning, the size of set of actions is A, which includesforward and backward short, middle and long size steps.Hence, a particle can move forward and backward withselecting one of these six actions. Let ∆ = {δj |j = 1, ..., A}be the set steps including both forward and backward ones,which a particle is able to take as part of the action it wantsto do. Once an action is decided and taken, the position ofthe particle will change as much as:-

f(Q, xi, a) = πδj (16)

where π is a probability calculated based on position and pos-sible move of neighbouring particles. Substituting equation(16) within equation (14), the new position of the particleunder consideration is determined. Here, the neighbourhoodis considered as the other peer particles that has connectivitywith the one under consideration, which is determined basedon the distance in between. Let Ni ∈ M be the set ofneighbouring peer particles (agents) of ith particle, whichis defined as:-

Ni = {xk| ε > d(xi,xk)} ∀k ∈M (17)

where d(xi,xk) is calculated as a Euclidean distance andε is the maximum distance, (the threshold), between twopeer particles set up to remain connected. Once a particlemoved as a result of the action taken, the reinforcementmechanism, the reward function in another name, assessesthe decision made for this action considering the previousstate of the particle before transition and the resulted positionof neighbouring peer particles.

r =

100, if

∑Ni

k=1 d(xi,xk) = Niε

Niε−∑Ni

k=1 d(xi,xk), if Niε >∑Ni

k=1 d(xi,xk)

−100, if∑Ni

k=1 d(xi,xk) ≤ 0

(18)

ε is also the maximum sensing distance of each particlein which the particles allowed to be apart and connected.The reward is mainly calculated based on the total distancefrom the particle to its neighbouring particles. If there is noneighbouring particle determined, which means the particlehas lost connection, then it will be punished with −100negative reward. If there is still connection but is less thenNiε, then the negative reward will be as much as calculatedin the second option of equation (18). If the total distancefrom its position to all other neighbouring particles equalsto Niε, then that deserves the whole reward, which is 100.

V. EXPERIMENTAL RESULTS

This section presents experimental results to demonstratea proof-of-concept Q learning algorithm works to help parti-cles (agents) self-train towards building a collaboration andbehave as an swarm member. The aim is also to revise andanalyse how the whole study turned out, judging whetherthe final implementation adhered to the expectations pre-set up. The algorithm has been implemented for a numberof swarm sizes using an agent-based simulation tool calledNetLogo [37].

For a successful evaluation, an agile approach has beenadopted to run the study through iterations in which thestudy has been incremented bit-by-bit. As per the approachvarious elements were considered ranging from the algorithmitself to the methods, techniques and tools used, analysingwhat each component did well and what could be donebetter. One way to get an insight into the level of successof each aspect of the study project is to imagine startingthe same project fresh whilst retaining all current knowledgeand contemplating what elements would be kept and whatwould be changed, and whether these changes could lead toan improved implementation.

A. Approximation and Evaluation

Throughout the project, the initial iteration was to start thestudy to find out a way to embed Q learning into PSO, whichhas been achieved in the previous sections as explained. Theultimate aim is to show that both algorithm work hand-by-hand to achieve a a swarm of learning agents whichcollaborate for collective behavior/intelligence. This aim isnot so black and white, but has many grey areas involved.This is because the algorithm is not just looking at thespeed of convergence for example which could very easilybe answered as to whether improvement has been made.Rather, the algorithm is subject to in depth observation as towhether the particles are behaving correctly, which in itselfhas intricacies that require close inspection.

(a) Initial positions for M-QL (b) Initial positions for PSO

Fig. 3: Initialised particles’ positions for both algorithms; multi-agent Q learning and particle swarm optimisation

In increment two, the goal was to get each particle toessentially be ”reactive” each other in a real world environ-ment. So if one particle moved, the others which are alsomoving simultaneously would need to take their fellow par-ticles movement as well as their own into consideration andreact accordingly so that they are always within proximityof their neighbours. This proximity prevents a particle frominvading its neighbours space whilst also not allowing it todrift too far out of the radius, if it does either of these it willget punished whereas if it stays the ”perfect” distance away,it will get the maximum reward.

This approach in theory would allow collaborative learningto occur as particles gain the knowledge of the correctexpected behaviour. This brings the question as to whetherthis was successful or not to which the results suggest it verymuch was. Each participating particle was actively reactingto the movement of its peers, and with the reinforcementmechanic, they were learning which actions would be bestto take with each iteration. In this iteration, two swarmsare created for which one was working with embedded Q ,learning (will be presented with the acronym of M-QL here-forth) and the other was run with a standard PSO.

The experimentation is organised to start with the initialswarms as seen in Fig. 3 and then the swarms are incre-mented through iteration as presented in Fig. 4, where thebehaviors of both swarms, learning swarm and PSO swarm,after 10, 50 and 500 iterations, respectively.

Fig. 3 and Fig. 4 illustrate the stark difference in whichthe particles move with a reinforced incentive. Whereas PSOis essentially solely designed to iteratively move particlestowards their best value, the addition of the Q-learning prox-imity measure prevents such erratic movement. Of courseover 500 iterations, some movement is going to occur asparticles will rarely be hitting their ”perfect” +100 rewardmovements, but as is visible in Fig. 4e, each particle isconnected to at least one other in a feasible proximity. In thisinstance, which does not always happen, the clusters havemerged together indirectly causing one large network. Thisis fine and can be expected to happen on occasion as throughindividual incremental movement through the environment,particles are going to move into the consideration radiusof other particles subsequently inheriting them into their

(a) After 10 iterations with m-QL

(b) After 10 iterations withPSO

(c) After 50 iterations with m-QL

(d) After 50 iterations withPSO

(e) After 500 iterations withm-QL

(f) After 500 iterations withPSO

Fig. 4: A set of comparative results to demonstrate thebehaviours of the learning algorithm versus PSO

peer particle Q-mechanic. This is something that in a realworld situation might need to be prevented if clusters arerequired to remain in that native cluster, however in thissimple environment, with no mechanic to prevent it, it isacceptable.

As can be observed from Fig. 4, the particles of theswarm, learning with M-QL, can demonstrate connectivityamong themselves via having a connecting distance fromone another while the swarm running PSO approximatesto a particular value, where all particles nearly come tooverlapping positions. In fact, the behaviours of the particlesin Fig. 4a, 4c, 4e clearly indicates that the individualparticles keeping distance neither much falling apart norremaining too close to one another, while the number ofiterations increases the distances become more fitting asFig. 4a shows some particles are still too close to each other,

but, Fig. 4e indicates a better positioning. On the other hand,Fig. 4b, 4d, 4f demonstrate how particles approximate to atargeted value without considering any having any distanceamong one another. More iterations help individual particlesgetting closer and taking overlapping positions more andmore.

The results of increment two confirm that particles are atleast capable of learning both in the individual sense and thesubsequent group sense. Although this is a fundamentallybasic example of learning, it acts as a basis which can bebuilt on in various ways. From running various parameterconfigurations in earlier experimentations, it was observedthat the particles choose the correct action to take in relationto the proximity as this showed they had learned whichaction would benefit them the most, which also showedthe components of state and action were working correctly.However as estimated, the workings of the increment arenot perfect as whilst through the first phase of iterationsthe clusters seem to keep a good proximity with particlesclearly getting negative rewards for interfering with theircounterparts.

It is observed that as the episode gets near the last 40% ofiterations on average, particles visibly begin to drift out of theneighbourhood, and once a particle loses connection, thereis no mechanic to get the particle back into a neighbourhoodradius, only luck can allow this to happen. Of course if thisincident happened in a real world environment, the resultscould be extremely costly. Therefore this issue is left tofurther research in the future through potential reinforcedpath-finding or efficient search algorithms. If a particle couldfind its way back into a neighbourhood, the algorithm couldbe much more efficient and realistically deployable in a realworld domain.

B. Individuals’ learning behaviour

The performance of learning particles was another aim ofthis research. For observing individual learning performance,three particles are taken under observation over 100 itera-tions. Due to the limitations of NetLogo, each simulation inthis regard is physically observed from start to end, for eachiteration, particles are individually judged whether each hasmade a good decision or a bad decision, good decision meanstaking the correct action and getting positive reward whilebad decision indicates taking wrong actions and receivingnegative rewards (penalty).

For quantification, a good decision will be dictated by aparticle moving in such a way it does not get too close to afellow particle and does not drift outside of the radius either.This brings into question the issue of synchronisation. Thesynchronisation problem occurs when two particles move atthe same time which can cause two particles to ”choose” tomove closer to each other at the same time or move furtheraway, thus causing a bad decision.

It also must be noted that when a particle moves out ofa radius, it has no real method of finding its neighbourhoodagain and therefore it becomes a flat-line of bad decisionson the graph.

(a) Particle 0

(b) Particle 1

(c) Particle 2

Fig. 5: Learning behaviours of the three particles.

The results of these graphs show that for the most part, thecorrect decision is usually made, which shows the hybridisedalgorithm does work. As mentioned before, an issue occurswhen two particles move at the same time because theysimply cannot predict what their fellow particles will dowhich causes proximity problems. This could simply berectified by having certain particles in the topology movingin ”turns”. If one particle is not moving on a turn, thiswould allow the other particles to successfully move closeror further away from it without any conflict.

As can be seen in the graph for particle 2 (Fig. 5c), itstarted to make bad decisions for around 10 iterations whichalso continued after. This was because it drifted out of theradius of its topology and has no method of getting back intoit. This is another problem that I believe could be simplyrectified, if a technique was implemented to allow this lostparticle to re-find or find another topology. Even a randomsearch method would give decent results if the landscape was

well populated. I also changed the reward value from 5 to0.5 but this had little to no difference shown in the graphs,as the subsequent discount factor and learning rate dont getenough of a chance to have a real impetus on the results. Thechanging of these parameters would be much more effectiveover longer iterations such as 500-1000, however the outputconfiguration of NetLogo makes any further analysis otherthan observation hard.

VI. CONCLUSIONS

This paper presents a proof-of concept study for demon-strating the viability of building collaboration among multi-ple agents through standard Q learning algorithm embed-ded in particle swarm optimisation. A number of parti-cles furnished with Q learning has been subjected to selftraining to act as members of a swarm and produce col-laborative/collective behaviours. Following introducing thealgorithmic foundation and structures, an experimental studyis conducted to demonstrate that the formulated algorithmproduces results supporting the aimed behaviours of thealgorithm. The results are produced with very simplisticassumptions, where further enhancements require furtherextensive theoretical and experimental studies.

REFERENCES

[1] N. Alechina, and B. Logan: Computationally grounded account ofbelief and awareness for AI agents, In Proceedings of The Multi-Agent Logics, Languages, and Organisations Federated Workshops(MALLOW 2010), Lyon, France, August 30 - September 2, 2010,volume CEUR-WS 627.

[2] M.B. Ayhan, M.E. Aydin, and E. Oztemel: A multi-agent based ap-proach for change management in manufacturing enterprises. Journalof Intelligent Manufacturing, 26 (5), 2015, pp. 975-988.

[3] M.E. Aydin, N. Bessis, E. Asimakopoulou, F. Xhafa, and J. Wu: Scan-ning Environments with Swarms of Learning Birds: A ComputationalIntelligence Approach for Managing Disasters. In IEEE InternationalConference on Advanced Information Networking and Applications(AINA), 2011, pp. 332-339.

[4] M. E. Aydin: Coordinating metaheuristic agents with swarm intel-ligence, Journal of Intelligent Manufacturing, 23, 4 (August 2012),991-999.

[5] M. E. Aydin, R. Kwan, C. Leung, and J. Zhang: Multiuser schedulingin hsdpa with particle swarm optimization, in Applications Of Evolu-tionary Computing, Proceedings, Giacobini, ed., vol. 5484 of LectureNotes In Computer Science, 2009, pp. 71-80.

[6] M. Aydin: Metaheuristic agent teams for job shop scheduling prob-lems. In Holonic and Multi-Agent Systems for Manufacturing, 2007,pp. 185-194.

[7] J. Bradtke, A. G. Barto, and P. Kaelbling: Linear least-squares algo-rithms for temporal difference learning, in Machine Learning, 1996,pp. 22-33.

[8] L. Bull: Two Simple Learning Classier Systems, in Foundations ofLearning Classier Systems, L. Bull and T. Kovacs, eds., no. 183 inStudies in Fuzziness and Soft Computing, Springer-Verlag, 2005, pp.63-90.

[9] L. Bull and T. Kovacs, eds., Foundations of Learning Classier Systems,vol. 183 of Studies in Fuzziness and Soft Computing, Springer, 2005.

[10] C. Claus and C. Boutilier: The dynamics of reinforcement learningin cooperative multiagent systems, in In Proceedings of NationalConference on Artificial Intelligence (AAAI-98, 1998, pp. 746-752.

[11] A. Colorni, M. Dorigo, V. Maniezzo, and M. Trubian: Ant systemfor job-shop scheduling. Belgian Journal of Operations Research,Statistics and Computer Science (JORBEL), 34(1) 1994, pp. 39-53.

[12] X. Dong: Consensus Control of Swarm Systems, In Formation andContainment Control for High-order Linear Swarm Systems, 2016,pp. 33-51. Springer Berlin Heidelberg.

[13] R. Eberhart and J. Kennedy: A new optimizer using particle swarmtheory, in Proc. of the 6th Int. Symposium on Micro-Machine andHuman Science, 1995, pp. 39 - 43.

[14] J. Foerster, Y.M.Assael, N. de Freitas, and S. Whiteson: Learningto communicate with deep multi-agent reinforcement learning, InAdvances in Neural Information Processing Systems 2016, pp. 2137-2145.

[15] M. Gath:Optimizing Transport Logistics Processes with MultiagentPlanning and Control. PhD Thesis, 2015, published by SpringerFachmedien Wiesbaden; 2016 Jul 11.

[16] . M. Hammami, and K. Ghediera: COSATS, X-COSATS: Two multi-agent systems cooperating simulated annealing, tabu search and X-over operator for the K-Graph Partitioning problem, Lecture Notes inComputer Science 3684, 2005, p. 647-653.

[17] L. M. Hercog:Better manufacturing process organization using multi-agent self-organization and co-evolutionary classifier systems: Themultibar problem. Appl. Soft Comput. 13(3),2013, pp. 1407-1418.

[18] H. Iima and Y. Kuroe: Swarm reinforcement learning algorithm basedon particle swarm optimization whose personal bests have lifespans, inNeural Information Processing, C. Leung, M. Lee, and J. Chan, eds.,vol. 5864 of Lecture Notes in Computer Science, Springer Berlin /Heidelberg, 2009, pp. 169-178.

[19] A. Kazemi, M. F. Zarandi, and S. M. Husseini: A multi-agent system tosolve the productiondistribution planning problem for a supply chain:a genetic algorithm approach. The International Journal of AdvancedManufacturing Technology, 44(1-2), 2009, pp.180-193.

[20] J. Kennedy and R. C. Eberhart: A discrete binary version of the particleswarm algorithm,” 1997 IEEE International Conference on Systems,Man, and Cybernetics. Computational Cybernetics and Simulation,Orlando, FL, 1997, pp. 4104-4108.

[21] J. Kennedy, R. Eberhart, and Y. Shi.: Swarm Intelligence, MorganKaufmann, San Mateo, CA, USA, 2001.

[22] J. R. Kok and N. Vlassis: Sparse cooperative q-learning, in Proceed-ings of the International Conference on Machine Learning, ACM,2004, pp. 481-488.

[23] M. Kolp, P. Giorgini, and J. Mylopoulos: Multi-agent architecturesas organizational structures, Autonomous Agents and Multi-AgentSystems, 13, 2006, pp. 3-25.

[24] A. Kouider, and B. Bouzouia: Multi-agent job shop scheduling systembased on co-operative approach of idle time minimisation. Interna-tional Journal of Production Research, 50(2), 2012, pp.409-424.

[25] P. Kouvaros and A. Lomuscio: Parameterised verification for multi-agent systems, Artificial Intelligence, 234, C (May 2016), pp. 152-189.

[26] Y. Meng: Recent Advances in Multi-Robot Systems, I-Tech Educationand Publishing, 2008, ch. Q-Learning Adjusted Bio-Inspired Multi-Robot Coordination, pp. 139-152.

[27] S. Mohebbi, and R. Shafaei: E-Supply network coordination: Thedesign of intelligent agents for buyer-supplier dynamic negotiations.Journal of Intelligent Manufacturing 23, (2012), pp.375-391.

[28] L. Panait and S. Luke: Cooperative multi-agent learning: The stateof the art. Autonomous agents and multi-agent systems, 11(3),2005,pp.387-434.

[29] R. Poli, J. Kennedy, and T. Blackwell: Particle swarm optimization.Swarm Intelligence, 1 2007, pp. 33-57.

[30] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduc-tion, MIT Press, Cambridge, MA, USA, 1998.

[31] M. Tasgetiren, Y. Liang, M. Sevkli, and G. Gencyilmaz: Particle swarmoptimization algorithm for makespan and total flow-time minimizationin permutation flow-shop sequencing problem. European Journal ofOperational Research, 177(3) 2007, pp. 1930-1947.

[32] G. Tesauro: Practical issues in temporal difference learning, in Ma-chine Learning, 1992, pp. 257-277.

[33] J. N. Tsitsiklis and R. Sutton: Asynchronous stochastic approximationand Q-learning, in Machine Learning, 1994, pp. 185-202.

[34] J. Vazquez-Salceda, V. Dignum, and F. Dignum: Organizing multi-agent systems, Autonomous Agents and Multi-Agent Systems, 11,2005, pp. 307-360.

[35] C. Watkins: Learning from delayed rewards, PhD thesis, CambridgeUniversity, 1989.

[36] C. Watkins and P. Dayan: Technical note: Q-learning. Machine Learn-ing, 8, 1992, pp. 279- 292.

[37] U. Wilensky and W. Rand: An introduction to agent-based model-ing: Modeling natural, social and engineered complex systems withNetLogo, MIT Press, Cambridge, 2015.

Documents

arXiv:1711.10574v2 [cs.AI] 5 Apr 2018 · 2018-04-06 · from past experiences for which how to stay interconnected ... live in colonies; such as ants, bees, bird ﬂocks etc. have