Lego Mindstorms Robots as a Platform for Teaching ... · Lego Mindstorms Robots as a Platform for Teaching Reinforcement Learning Peter Vamplew School of Computing University of Tasmania

Lego Mindstorms Robots as a Platform for TeachingReinforcement Learning

Peter VamplewSchool of Computing

University of TasmaniaSandy Bay, Tasmania, Australia

[email protected]

ABSTRACT

This paper examines the suitability of LegoMindstorms robotic kits as a platform for teaching theconcepts of reinforcement learning. The reinforcementlearning algorithm Sarsa was implemented on board anautonomous Mindstorms robot, and applied to twolearning tasks. The reasons behind the differing resultsobtained on these two tasks are discussed, and severalissues related to the suitability of Mindstorms as aplatform for teaching or experimentation withreinforcement learning are identified.

1. INTRODUCTION

Over recent years the increasing availability of low-costprogrammable robotics systems has lead to a growth inthe use of robots within an educational context. Atprimary and secondary school level, events such asRoboCupJunior have been seen as a means ofdeveloping technical literacy amongst students byproviding them with an exciting, highly motivatinglearning environment [1] It has also been suggested thatthese motivational benefits apply at the level ofundergraduate teaching with [2, 3] amongst otherspromoting the use of robots in introductory computerscience and artificial intelligence courses. The researchdescribed in this paper was motivated by a proposal toincorporate robotics within the teaching of an intelligentagents course within the School of Computing at theUniversity of Tasmania.

Reinforcement learning would appear to be an aspect ofartificial intelligence for which this teachingmethodology is particularly appropriate. This form ofmachine learning is gathering an increasing amount ofattention amongst artificial intelligence researchers, andas such should be considered for inclusion in thecurriculum of undergraduate artificial intelligence units.The reinforcement learning paradigm involves anautonomous agent which learns through interaction withits environment the optimal outputs to produce inresponse to observed inputs in order to maximise theoverall reinforcement received. This approach is wellsuited to tasks where a goal can be specified for thesystem, but where expert knowledge about the means ofachieving that goal does not exist.

An autonomous robot is a natural example of a situationto which reinforcement learning is applicable. The robotgathers input about the environment via its sensors, andcan interact with the environment via its manipulators.

The task desired of the robot is defined in terms of areinforcement signal, and the robot aims to choose itsactions in such a way as to maximise that reinforcement.Control systems for autonomous robots have been one ofthe main applications of reinforcement learning (forexample [4] and [5]), and so this would appear to be anextremely appropriate example to use in teaching thisstyle of machine learning. In particular the use of a robotsituated within a real, physical location emphasises therole of the environment within reinforcement learningmuch more clearly than would be the case within asimulation or more abstract application. However [3]states that "learning techniques such as neural networksand reinforcement learning are too complex toimplement directly on the LEGO hardware". This paperaims to show that by carefully selecting the task androbot configuration, it is in fact possible to successfullyimplement and demonstrate reinforcement learning usinga low-cost robotics system.

2. LEGO MINDSTORMS AND LEGOS

Clearly one of the main restrictions on the adoption ofeducational robotics is the cost involved in providingenough robotics hardware for students to have sufficienthands-on experience. The Mindstorms robotsmanufactured by Lego provide a possible platform foruse in such a course. These robots are relatively low-cost, programmable, and due to being constructed fromLego blocks have a flexible physical structure which canbe adapted to a wide-range of tasks.

Unfortunately the standard software bundled with aMindstorms robot has several major restrictions fromthe point of view of implementing machine learningalgorithms. Most important amongst these limitations arethe extremely small number of variables available, andthe lack of support for floating point arithmetic.However alternative programming environmentsproviding enhanced functionality such as Not Quite C[6] and LegOS [7] have been developed byMindstorms enthusiasts. For the purposes of thisresearch LegOS was selected as the implementationenvironment. LegOS programs are written in C or C++,and cross-compiled under a Linux environment beforebeing downloaded to the RCX controller on theMindstorms robot. It should be noted that duringdevelopment of the software for this project numerousdiscrepancies were found between the LegOS librariesand the corresponding documentation. Whilst this isperhaps to be expected given LegOS’s 'alternative' statusand on-going development (and is in fact alluded to in

the documentation), it would nevertheless provetroublesome were students asked to develop code in thisenvironment.

3. THE LEARNING TASKS AND ALGORITHM

Two different tasks were chosen to test the LegOSimplementation of reinforcement learning. Theseproblems had different problem characteristics as well asrequiring different physical configurations of the robot.In both cases known heuristic algorithms were availableto allow comparison with the behaviour achieved by thelearning system.

As stated earlier, the task facing a reinforcement learningsystem is to find an appropriate mapping from input tooutput values so as to maximise the reinforcementreceived by the learning agent. In this case the inputs tothe system are measurements of the environment madeby the robot’s sensors, whilst the outputs are used tocontrol the robot’s effectors. With regards to theMindstorms robots the available inputs are light andcontact sensors, whilst the outputs are generally used todrive the robot's motors. Using only the componentsfrom a single Robotics Invention System kit, the robotis restricted to three inputs and two outputs. (There are infact three outputs, but only two motors).

The most flexible control of the robot would be obtainedby mapping each output directly to a control value for asingle motor. However most current reinforcementlearning algorithms have difficulties in dealing withoutputs of continuous nature (although progress has beenmade in this area - for example see [8]). The alternativeis to define a small number of discrete actions which areavailable to be selected by the control system. For anygiven state information, the control system will estimatethe value of executing each of these actions in thatsituation and select the action with the highest expectedreturn.

The software used in this research implements the Sarsa[9] algorithm which is based on the concept of temporaldifference learning [10]. Sarsa learns through experiencethe value associated with selecting a particular action ina particular state. After an action is selected andexecuted, its estimated value is updated on the basis ofany immediate reward and the estimated value of theaction which will be selected for the subsequent state.The Sarsa algorithm is given below; Q(s,a) is theestimated value of taking action a from state s, r is theimmediate reward received after taking an action, α is alearning-rate parameter, and γ is a discounting factor.

Initialise Q(s,a) with random valuesFor each learning episode

Observe initial state sSelect an action a to performRepeat

Execute a, observe r and the new state s'Select action a' from s'Q(s,a) = Q(s,a) + α(r + γQ(s',a') - Q(s,a))s = s'; a = a'

Until end of episode

The action selection process is based on the current Qvalues. In order to balance the need to make exploratoryactions to learn about the problem space with the need toexploit what has already been learnt about theenvironment, an ε-greedy strategy is used in which theaction is selected greedily with probability 1-ε, andrandomly with probability ε.

The simplest implementation of Sarsa is the tabular formin which the Q values are stored in a table with a cell foreach state-action pair. This does not scale well toproblems with a large number of states (for example inthe line-following task described in Section 3.2 the tablewould require 256 x 256 x 3 = 196608 cells), and so theQ values are often represented instead using some formof function approximation. For this work linearapproximators were used to represent the values of state-action pairs – this function-fitting system would be toosimple for many tasks, but from analysis of the twosample tasks used in this study it could be seen that amore complex approximation function was notnecessary. Linear methods have previously beensuccessfully applied within a reinforcement learningcontext by several authors, including [11] and [12].

For the tasks used in these experiments it was found thatit was necessary to insert a delay loop in theimplementation of the Sarsa algorithm, as otherwise theamount of movement executed by the robot betweenaction selections was so small that the vast majority ofactions did not result in any changes in either the inputstate or the reinforcement signal. The need for this delayindicates that the processing power of the CPU would besufficient to scale up to more sophisticated forms offunction approximation (such as small neural networks)should this be required for more complex learning tasks.

3.1. THE WALKING TASK

The first task used in this experiment involved the robotlearning to perform a walking action. The robot structureused for this task had four legs, but with both legs on thesame side of the body being driven by a single sharedmotor, as shown in Figure 1. Notice the contact sensorpositioned directly above the motor (this is duplicated onthe far side of the robot), and also the downward facinglight sensor at the front of the robot.

Figure 1: The robot configuration used in thewalking task experiments.

In order for such a robot to walk (or more precisely,shuffle) it is necessary for the two pairs of legs to moveout of phase with each other. If both motors are runsimultaneously the effect is that the legs remain fixed,whilst the body rotates around the motors. A walkingaction can be achieved by a heuristic algorithm whichintroduces a pause in the activation of each motor, timedto create an asymmetry in the movement of the legs. Inthis algorithm, there is no input from the externalenvironment. Instead the pausing of the motors isdetermined by sensing the positioning of the robot’s legsvia the contact sensors which are pressed when each legis raised to its highest position relative to its motor.Varying the length of the pause gives rise to gaits ofdiffering appearance and effectiveness, as investigatedby [13].

This task was chosen to investigate whether areinforcement learning system could learn to emulate orimprove on any of the walking actions produced by theheuristic walking algorithm. The inputs to the TD(0)algorithm used were similar to those used by theheuristic algorithm. For each motor a count wasmaintained of the number of time-steps for which thatmotor had been active since its corresponding contactsensor had last been activated. The inputs to the linearapproximators consisted of this value for each motor, aswell as the difference between the values for the twomotors. All inputs were scaled to the range –0.5 to 0.5.Essentially this provides the robot with a simple modelof the relative positioning of its legs.

There were three possible actions to select from – bothmotors on full, left motor on full and right motor off, andleft motor off and right motor on full. Therefore thesystem contained a linear approximator for each of theseactions, which were initially set to small randomweights.

The desired behaviour for the robot was to walk forwardas quickly as possible. This was measured by placing therobot on a sheet of paper printed with a smoothgreyscale gradient from black to white. After each time-step the brightness of the surface below the robot wassensed via a single light sensor, and this was comparedto the corresponding value for the previous time-step todetermine whether the robot had moved forward.Movements in a forward direction were rewarded withpositive reinforcement, whilst backwards moves werepenalised with negative reinforcement, and failure tomove received zero reinforcement.

3.2. THE LINE-FOLLOWING TASK

The second task required a wheeled robot (based on theRoverbot design in the Lego Constructopedia, with theaddition of two side-by-side light sensors mounted at thefront of the chassis, as shown in Figure 2) to follow ablack line marked on a white surface. The line wassufficiently wide for both sensors to be positioned over itat the same time if the robot was oriented directly alongthe line, and it formed a track consisting of a circuit witha mixture of 10 left and right hand bends of varyingsharpness.

Figure 2: The robot configuration used in the line-following experiments.

A heuristic algorithm was constructed and tested toensure that the task of following the line was possible.This algorithm used the ratio between the values of theleft and right light sensors to select between three pre-defined behaviours (move forward with both motors onequally, turn left by setting the left motor to runbackwards slowly whilst the right motor ran forwards ata higher speed, and turn right). A robot following thisalgorithm completed a lap of the circuit after theexecution of around 2300 actions. Note that a delay loopwas built into this algorithm to ensure that it wasselecting actions at the same rate as the reinforcementlearning system, so as to allow for valid comparisons tobe made between the performance of the two controlsystems.

The inputs to the linear approximators used in thereinforcement learning system for this task were thevalues of the light sensors (scaled to a range of –0.5 to0.5 by calibration relative to values obtained from a lightand dark surface at the start of the training process).These approximators were used to select between thesame three actions (forward, turn left, turn right) used bythe heuristic line-following algorithm.

There was no direct way to measure the progress of therobot around the circuit, and so no reinforcementfunction which directly corresponded to the desiredbehaviour could be constructed. Instead two alternativereinforcement signals were used which attempted toapproximately correspond to the desired behaviour. Thedetails of these, and their effect, are discussed in theresults section.

4. EXPERIMENTAL METHOD

Several trials were run for each of the learning tasks,using different initial random weights. Each trialconsisted of many episodes. At the beginning of eachepisode the robot was placed in a starting position withinthe learning environment, and a button on the RCXblock (the main body of the robot) was pressed toindicate whether this was a training or testing episode.

During training episodes, 95 % of the actions wereselected based on the expected values indicated by thelinear approximators, whilst the remaining 5% of theactions were selected randomly, and the weights of thelinear approximators were updated using the SARSAalgorithm, with a learning rate of 0.1 and a discountingfactor of 0.9. During test episodes no weight updateswere performed, and all actions were selected greedilyaccording to the current policy. Episodes were endedmanually via an RCX button whenever the robot left thesurface prepared for the task, or when the task wassuccessfully completed.

There was a large amount of manual involvementrequired to start and stop each training and testingepisode, and in resetting the robot to its starting positionprior to commencing each episode. Therefore it was notpractical to run a large number of trials and analyse theresults statistically. Given that the intention of this paperis to examine the suitability of the Mindstorms robotas a teaching platform rather than to compare alternativelearning algorithms this was not seen as a major issue.Hence the results presented will be primarily qualitativerather than quantitative in nature.

5. RESULTS AND DISCUSSION

5.1. THE WALKING TASK

The performance of the reinforcement learningalgorithm on the walking task was consistently poor. Thelearning system always converged to a solution whereboth motors were constantly turned on. This can giverise to a moderately effective gait, but only when thelegs are not initially symmetrically positioned. The robotwas always started from a symmetrical configuration asthis was its natural resting position when the motorswere inactive. During training the occasional randomlyselected action was sufficient to eventually break thesymmetry of the legs and so the robot would make somelimited progress along the measuring gradient. Howeverduring testing all actions were selected greedily, and thelearning system did not prove capable of learning toproduce this symmetry-breaking behaviour of its ownaccord, and therefore no forward progress was achievedduring test episodes.

During these experiments several factors were observedwhich contributed to this failure. Firstly the values of thelight sensor were noted to be extremely noisy. Variationsof up to 10% of the overall range of the light sensor wereobserved whilst the sensor was in a static position over asurface of constant intensity. This was furtherexacerbated by variations in the height or orientation ofthe light sensor relative to the surface as the robotmoved. As a result the reinforcement signal received bythe learning system was extremely noisy.

A second problem which was observed was slippage inthe positioning of the robot. Whenever one of the motorswas deactivated, the weight of the robot would cause thegears attached to that motor to turn back to a positionwhere the robot body was as low as possible relative tothe motor. This slippage could be slowed by applyingthe motor’s brake, but this merely reduced the speed of

the movement rather than eliminating it completely andhad the side-effect of rapidly depleting the batteries. Theresult of this unintended motion of the robot was agrowing disparity between the robot’s model of theposition of its legs and their actual position over thecourse of each training episode. This was a majorproblem as the model of the robot’s position was theonly input provided to the learning system.

5.2. THE LINE-FOLLOWING TASK

After some experimentation with the reinforcementsignal, the reinforcement learning system was muchmore successful on the line-following task than on thewalking task.

In the initial trials, the reinforcement signal usedrewarded the robot with positive reinforcement for anyaction which led to the robot remaining on the track(measured by applying a threshold to the summed valueof the light sensors). As the actions available did notprovide an option for staying still, this was expected tolead to the robot moving forward along the path andeventually traversing the circuit. However the learningalgorithm discovered that alternating turning left andright allowed the robot to reverse slowly in a straightline, and hence maximal reinforcement could beachieved by travelling along a straight section of line atthe beginning of the track, and then reversing back alongthat same section of track.

To overcome this, the reinforcement function wasmodified so as to only reward the robot on time-steps onwhich it moved forward whilst remaining on the track.This required the reinforcement function to take intoaccount which action had been performed (as there wasno other means of determining whether the robot hadmoved forward) Therefore the reinforcement functionwas no longer being determined strictly on the basis ofthe environment which violates some of the principles ofreinforcement learning. Unfortunately this could not beavoided given the limited sensory capabilities of therobot. (In principle the track itself could have beenmarked out with a greyscale gradient, but given theproblems in sensing such a gradient which wereobserved during the walking task, this was not expectedto be successful).

Using this revised reinforcement function, the learningsystem proved capable of learning to execute the taskcorrectly in most trials. After around 20 trainingepisodes (which took about 20 minutes to carry out), therobot successfully navigated an entire lap of the circuitin both training and testing modes. The number ofactions required for this lap in testing mode was on theorder of 1700, which was approximately 25% faster thanthe speed achieved by the heuristic algorithm. Table 1shows an example of the improvement in the robot'sperformance over a series of training episodes. Note thatto reduce the overall amount of time involved in trainingthe robot, testing was only carried out when there hadbeen signs of improved performance during the previoustraining episode.

Training Testing

Episodenumber

Time steps Bends Timesteps

Bends

0 71 0

1 95 0 86 0

2 89 0

3 99 0

4 297 1 175 0

5 297 1

6 330 1

7 254 1

8 514 1

9 474 1

10 307 1

11 349 1

12 2075 Lap 263 1.5

13 452 1

14 341 1

15 280 1

16 395 1

17 2084 Lap 1710 Lap

Table 1: Sample results from training on the line-following problem. Results are reported in termsof the number of decision-making cycles for whichthe robot remained on the track, and the numberof bends of the track successfully negotiatedduring this time.

Even with the revised reinforcement function occasionalunsuccessful trials were still noted, in which the robotlearnt an interesting form of aberrant behaviour. Early intraining the robot is likely to lose the path, and thenmakes essentially random moves until it eitherrediscovers the path or leaves the training area (at whichpoint the episode is ended). In some cases the robotmade sufficient consecutive turns in the same directionto regain the path but facing in the opposite direction. Ifthese episodes were allowed to continue, the behaviourwhich resulted was that the robot would follow the pathuntil a bend was encountered, at which point it wouldturn 180° and head back in the opposite direction. Thispattern of behaviour could be repeated each time a bendwas reached whilst still receiving a high level ofreinforcement.

6. CONCLUSION

The failure of the learning system on the walking taskcan be attributed to two factors – the noise in the lightsensor which resulted in a very noisy reinforcement

signal, and the robot’s inability to accurately sense itsown configuration. For this task there was no input fromthe external environment, and so all state informationwas derived from the robot’s model of its own currentposition. The combination of the inability to directlymonitor the leg position, and the large amount ofslippage which occurred when a motor was set toinactive, meant that the state information provided to thelearning algorithm was extremely inaccurate. Onepossible solution would be to make use of rotationsensors (not supplied in the standard Mindstorms kit)to provide direct monitoring of the motor positions.

In contrast the input information for the line-followingtask was derived entirely from the external environment,with no need to monitor the robot’s internal state. Thenoisiness of the light sensors also posed fewerdifficulties in this task, as the information being derivedfrom those sensors was less sensitive to smallfluctuations over time. The primary issue affectingsuccess on this task was the nature of the reinforcementsignal. It was observed that using a reinforcement signalwhich does not directly correspond to the desiredbehaviour can lead to the robot learning aberrantbehaviours which receive high reinforcement but whichare not desirable. Far from being a problem, thisoccurrence is actually beneficial to the teaching ofreinforcement learning, as it emphasises the key roleplayed by the reinforcement function in this style oflearning. Observations of similar unexpected behaviourhave been made by other authors in reinforcement-stylelearning tasks - for example [14] reported this in thecontext of evolutionary computation.

Of more concern from an educational perspective, theMindstorms robot’s limited capacity for sensing itsenvironment made construction of a suitablereinforcement function for the line-following taskdifficult. If the robot could measure greyscale valuesmore reliably, it may have proved possible to mark thepath with a greyscale gradient so that direction anddistance of movement along the path could be sensed.This would have enabled the construction of areinforcement signal derived entirely from theenvironment, with no need to explicitly consider theaction being performed by the robot, which would havebeen more in line with the reinforcement learningparadigm.

The learning algorithm used in these experiments was arelatively simple form of reinforcement learning, usingthe single-step form of the temporal difference algorithm(TD(0)) in conjunction with a simplistic functionapproximator (the linear approximator). However theprocessing power of the RCX was more than sufficientto cope with the demands of this learning algorithm, andthe software used could easily be extended to implementmore complex forms of reinforcement learning such asT D ( λ ) or the use of small neural networks asapproximators.

This study has shown that the combination of theMindstorms robot and the LegOS programmingenvironment is sufficient for student experimentationwith reinforcement learning as long as the limitations of

the robots are taken into account in the choice of the taskto be learnt. The key characteristics of the task are thatthe robot’s behaviour should be determined by the stateof the environment rather than by the robot’s ownconfiguration, and that it should be possible to define anappropriate reinforcement signal which corresponds asclosely as possible to the desired behaviour of the robotwhilst still being capable of being derived by the roboton the basis of its limited sensors.

7. ACKNOWLEDGMENTS

I wish to acknowledge the assistance of the Centre forComputational Intelligence at De Montfort University,Leicester, UK for the use of their facilities in conductingthis research.

REFERENCES[1] Sklar, E., Eguchi, A. & Johnson, J. (2002).

RoboCupJunior: Learning with educationalrobotics. In Proceedings of RoboCup-2002:Robot Soccer World Cup VI.

[2] Kumar, D. & Meeden, L. (1998). A RobotLaboratory for Teaching Artificial Intelligence.In Joyce, D. (Ed.). Proceedings of the Twenty-ninth Technical Symposium on ComputerScience Education (SIGCSE-98), ACM Press.

[3] Sklar, E., & Parsons, S. (2002).RoboCupJunior: A vehicle for enhancingtechnical literacy. In Proceedings of the AAAI-02 Mobile Robot Workshop.

[4] Mahadevan, S. & Connell, J. (1992). Automaticprogramming of behavior-based robots usingreinforcement learning, Artificial Intelligence,Vol. 55, 311-365.

[5] Asada, M., Noda, S., Tawaratsumida, S. &Hosoda, A. (1996). Purposive BehaviorAcquisition for a Real Robot by Vision-BasedReinforcement Learning, Machine Learning,Vol. 23, 279-303.

[6] Baum, D. (n.d.) Retrieved February 10 2003from http://www.baumfamily.org/nqc/

[7] Noga, M. (n.d.) Retrieved February 8 2003from http://www.noga.de/legOS/

[8] Gaskett, C., Wettergreen, D., & Zelinsky, A.(1999). Q-Learning in Continuous State andAction Spaces. In Proceedings of the 12thAustralian Joint Conference on ArtificialIntelligence, Springer-Verlag.

[9] Rummery, G. A., & Niranjan, M. (1994). On-line q-learning using connectionist systems(No. CUED/F-INFENG/TR 166): CambridgeUniversity Engineering Department.

[10] Sutton, R.S. (1988). Learning to predict by themethods of temporal differences. MachineLearning, Vol. 3, 9-44.

[11] Sutton R.S. (1996). Generalisation inreinforcement learning: Successful examplesusing sparse coarse coding. In Touretzky D.S.,Mozer M.C., & Hasselmo M.E. (Eds.).

Advances in Neural Information ProcessingSystems: Proceedings of the 1995 Conference(1038-1044). Cambridge, MA: The MIT Press.

[12] Stone P., Sutton R.S., & Singh S. (2001)."Reinforcement learning for 3 vs. 2 Keepaway",in Stone P, Balch T. and Kreatzschmarr G.(Eds.), RoboCup-2000: Robot Soccer WorldCup IV, Berlin:Springer-Verlag.

[13] Graves, A.R. & Bierton, R.A. (2000).Investigation of Inter-Step Pause Values for aSimple Legged Robot. Unpublished internalreport of the Centre for ComputationalIntelligence, De Montfort University.

[14] Sims, K. (1994). Evolving Virtual Creatures.Computer Graphics (Siggraph '94Proceedings),15-22

Documents

Lego Mindstorms Robots as a Platform for Teaching ... · Lego Mindstorms Robots as a Platform for Teaching Reinforcement Learning Peter Vamplew School of Computing University of Tasmania