Learning Continuous State/Action Models for Humanoid Robotsial.eecs.ucf.edu/pdf/Sukthankar-Astrid-FLAIRS2016.pdf · 2016. 6. 1. · This section describes Continuous Action and State

Learning Continuous State/Action Models for Humanoid Robots

Astrid Jackson and Gita SukthankarDepartment of Computer Science

University of Central FloridaOrlando, FL, U.S.A.

{ajackson, gitars}@eecs.ucf.edu

Abstract

Reinforcement learning (RL) is a popular choice forsolving robotic control problems. However, applyingRL techniques to controlling humanoid robots with highdegrees of freedom remains problematic due to the diffi-culty of acquiring sufficient training data. The problemis compounded by the fact that most real-world prob-lems involve continuous states and actions. In order forRL to be scalable to these situations it is crucial thatthe algorithm be sample efficient. Model-based meth-ods tend to be more data efficient than model-free ap-proaches and have the added advantage that a singlemodel can generalize to multiple control problems. Thispaper proposes a model approximation algorithm forcontinuous states and actions that integrates case-basedreasoning (CBR) and Hidden Markov Models (HMM)to generalize from a small set of state instances. Thepaper demonstrates that the performance of the learnedmodel is close to that of the system dynamics it approxi-mates, where performance is measured in terms of sam-pling error.

1 IntroductionIn recent years, reinforcement learning (RL) has emergedas a popular choice for solving complex sequential decisionmaking problems in stochastic environments. Value learningapproaches attempt to compute the optimal value function,which represents the cumulative reward achievable from agiven state when following the optimal policy (Sutton andBarto 1998). In many real-world robotics domains acquir-ing the experiences necessary for learning can be costly andtime intensive. Sample complexity, which is defined as thenumber of actions it takes to learn an optimal policy, is a ma-jor concern. Kuvayev and Sutton (1996) have shown signifi-cant performance improvements in cases where a model waslearned online and used to supplement experiences gatheredduring the experiment. Once the robot has an accurate modelof the system dynamics, value iteration can be performed onthe learned model, without requiring further training exam-ples. Hester and Stone (2012) also demonstrated how themodel can be used to plan multi-step exploration trajecto-ries where the agent is guided to explore high uncertaintyregions of the model.

Copyright c© 2016, Association for the Advancement of Artificial

(a) (b)Figure 1: Approximation of the transition probability den-sity function for the joint motion of a humanoid robot. (a)The model is learned from the trajectory of the robot’s leftwrist. (b) The continuous state space is represented by thex-y-z position of the robot’s end effectors; the continuousaction space is composed of the forces on these effectors ineach direction.

Many RL techniques also implicitly model the system asa Markov Decision Process (MDP). A standard MDP as-sumes that the set of states S and actions A are finite andknown; in motion planning problems, the one-step transi-tion probabilities P (si ∈ S|si−1 ∈ S, ai−1 ∈ A) areknown, whereas in learning problems the transition prob-abilities are unknown. Many practical problems, how-ever, involve continuous valued states and actions. In thisscenario it is impossible to enumerate all states and ac-tions, and equally difficult to identify the full set of tran-sition probabilities. Discretization has been a viable op-tion for dealing with this issue (Hester and Stone 2009;Diuk, Li, and Leffler 2009). Nonetheless, fine discretiza-tion leads to an intractably large state space which cannotbe accurately learned without a large number of training ex-amples, whereas coarse discretization runs the risk of losinginformation.

In this paper we develop a model learner suitable for con-tinuous problems. It integrates case-based reasoning (CBR)and Hidden Markov Models (HMM) to approximate the suc-

Intelligence (www.aaai.org). All rights reserved.

Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference

392

cessor state distribution from a finite set of experienced in-stances. It thus accounts for the infinitely many successorstates ensuing from actions drawn from an infinite set. Weapply the method to a humanoid robot, the Aldebaran Nao,to approximate the transition probability density function fora sequence of joint motions (Figure 1). The results show thatthe model learner is capable of closely emulating the envi-ronment it represents.

In the next section we provide formal definitions relevantto the paper. In Section 3 we give a detailed description ofour Continuous Action and State Model Learner (CASML)algorithm. CASML is an extension of CASSL (Continu-ous Action and State Space Learner) (Molineaux, Aha, andMoore 2008) designed for learning the system dynamics of ahumanoid robot. CASML distinguishes itself from CASSLin the way actions are selected. While CASSL selects ac-tions based on quadratic regression methods, CASML at-tempts to estimate the state transition model utilizing an Hid-den Markov Model (HMM). Section 4 describes our exper-iments, and in Section 5 we analyze the performance of ourmodel approximation algorithm. Section 6 summarizes howour method relates to other work in the literature. We con-clude in Section 7 with a summary of our findings and futurework.

2 Background2.1 Markov Decision ProcessA Markov Decision Process (MDP) is a mathematicalframework modeling decision making in stochastic environ-ments (Kolobov 2012). Typically it is represented as a tuple(S,A, T, γ,D,R), where

• S is a finite set of states.

• A is a finite set of actions.

• T = {Psa} is the set of transition probabilities, wherePsa is the transition distribution; i.e. the distribution overprobabilities among possible successor states, of takingaction a ∈ A in state s ∈ S.

• γ ∈ (0, 1] is the discount factor which models the impor-tance of future rewards.

• D : S �→ [0, 1] is an initial-state distribution, from whichthe initial state s0 is drawn.

• R : S×A �→ R is a reward function specifying the imme-diate numeric reward value for executing a ∈ A in s ∈ S.

2.2 Case Based ReasoningCase-based reasoning (CBR) is a problem solving techniquethat leverages solutions from previously experienced prob-lems (cases). By remembering previously solved cases,solutions can be suggested for novel but similar problems(Aamodt and Plaza 1994). A CBR cycle consists of fourstages:

• Retrieve the most similar case(s); i.e. the cases most rele-vant to the current problem.

• Reuse the case(s) to adapt the solutions to the currentproblem.

(a) (b)

(c) (d)

Figure 2: Process of extracting the successor states by uti-lizing the transition case base. (a) Retrieve similar states(s1, . . . , sn) of kNN. (b) Determine the actions previouslyperformed in these states. (c) Reuse by identifying similaractions (a1, . . . , ak) using cosine similarity. (d) Select si+1

as possible successor states.

• Revise the proposed solution if necessary (this step is sim-ilar to supervised learning in which the solution is exam-ined by an oracle).

• Retain the new solution as part of a new case.A distinctive aspect of CBR systems is that no attempt is

made to generalize the cases they learn. Generalization isonly performed during the reuse stage.

3 MethodThis section describes Continuous Action and State ModelLearner (CASML)1, an instance-based model approxima-tion algorithm for continuous domains. CASML integratesa case-based reasoner and a Gaussian HMM and returns thesuccessor state probability distribution. At each time step,the model learner receives as input an experience of theform 〈si−1, asi−1, si〉, where si−1 ∈ S is the prior state,ai−1 ∈ A is the action that was taken in state si−1 andsi ∈ S is the successor state. Both the states in S and theactions in A are real-valued feature vectors.

Much like in Molineaux, Aha, and Moore (2008) the casebase models the effects of applying actions by maintainingthe experienced transitions in the case base CT : S × A ×ΔS. To reason about possible successor states, CT retainsthe observations in the form:

c = 〈s, a,Δs〉,1Code available at https://github.com/evenmarbles/mlpy

393

where Δs represents the change from the prior state to thecurrent state.

The case base supports a case-based reasoning cycle asdescribed in (Aamodt and Plaza 1994) consisting of re-trieval, reuse, and retention. Since the transition model is fitto the experiences acquired from the environment directly,CT is never revised.

In addition to the case base, a Gaussian HMM is trainedwith the state si of each received experience. While thecase base allows for the identification of successor states,the HMM models the transition probability distribution forstate sequences of the form 〈si−1, si〉.

At the beginning of each trial, the transition case base isinitialized to the empty set. The HMM’s initial state distri-bution and the state transition matrix are determined usingthe posterior of a Gaussian Mixture Model (GMM) fit to asubset of observations, and the emission probabilities are setto the mean and covariance of the GMM.

New cases are retrieved, retained and reused accordingto the CBR policies regarding each method. A new casec(i) = 〈si−1, ai−1,Δs〉, where Δs = si − si−1 is retainedonly if it is not correctly predicted by CT , which is the caseif one of the following conditions holds:

1. The distance between c(i) and its nearest neighbor in CT

is above the threshold:

d(c(i), 1NN(CT , ci)) > τ.

2. The distance between the actual and the estimated transi-tions is greater than the error permitted:

d(c(i)Δs, CT (si−1, ai−1)) > σ.

The similarity measure for case retention in CT is the Eu-clidean distance of the state features of the case’s state.

Algorithm 1 Approximation of the successor state probabil-ity distribution for continuous states and actions

1: CT : Transition case base 〈S ×A×ΔS〉2: H: HMM trained with state features s3: T : Transition probabilities T : S ×A× S �→ [0, 1]4: ———————————————————5: procedure P(si, ai)6: CS ← retrieve(CT , si)7: CA ← reuse(CS , ai)8: T (ai, si) ← ∅

9: ∀c ∈ CA : T (si, ai) ←T (si, ai) ∪ 〈cs, predict(H, 〈si, si + cΔs〉)〉

10: return normalize(T (si, ai))11: end procedure

Given a set of k episodes, the initial state distribution isdefined as D = {s(0)0 , s

(1)0 , . . . , s

(k)0 }. Therefore the initial

state is added to D at the beginning of each episode, suchthat D = D ∪ s

(i)0 . Initial states are drawn uniformly from

D.To infer the successor state probability distribution for an

action ai taken in state si, CASML consults both the tran-sition case base and the HMM. The process is detailed in

Figure 3: Estimation of the state transition probabilities fora given state and a given action. The forward algorithmof the HMM is used to calculate the probability for eachone-step sequence 〈si, si+1〉, that was identified utilizing thecase base (Figure 2(d)).

Algorithm 1. First the potential successor states must beidentified. This is accomplished by performing case re-trieval on the case base, which collects the states similar tosi into the set CS (see Figure 2(a)). Depending on the do-main, the retrieval method is either a k-nearest neighbor ora radius-based neighbor algorithm using the Euclidean dis-tance as the metric. The retrieved cases are then updatedinto the set CA by the reuse stage (see Figure 2(b) and 2(c)).CA consists of all those cases ck = 〈sk, ak,Δsk〉 ∈ CS

whose action ak are cosine similar to ai, which is the caseif dcos(ak, ai) ≥ ρ. At this point all successor states si+1

can be predicted by the calculation of the vector additionsi +Δsk (see Figure 2(d)).

The next step is to ascertain the probabilities for transi-tioning into each of these successor states (Figure 3). Thelog likelihood of the evidence; i.e. the one-step sequence〈si, si+1〉, for each of the potential state transitions can beinduced by performing the forward algorithm of the HMM.The successor state probability distribution is then obtainedby normalizing over the exponential of the log likelihood ofall observation sequences.

4 ExperimentsOur goal was for a humanoid robot, the Aldebaran Nao, tolearn the system dynamics involved in performing a jointmotion. For our experiment we first captured a motion us-ing an Xbox controller to manipulate the robot’s end effec-tors . The x-, y-, and z-positions of the end effectors serveas data points for the joint trajectory and the continuous ac-tions are the forces applied to the end effectors in each of thecoordinate directions. The force is determined by the posi-tional states of the controller’s left and right thumb stickswhich produce a continuous range in the interval [-1, +1]and generate a displacement that does not exceed 4 millime-ters. We record the actions applied at each time step as thepolicy that will be sampled from the model. All experimentswere performed in the Webots simulator from Cyberbotics2,however they are easily adaptable to work on the physicalrobot. Henceforth, the trajectories which are acquired fromthe simulator will be referred to as the sampled trajectoriesand the trajectories sampled from the model will be referredto as approximated trajectories.

2http://www.cyberbotics.com

394

(a) (b)

Figure 4: Performance of the model as quantified by the ap-proximation error. (a) The accuracy of the model identifiedby various error metrics. (b) Comparison of the estimatedtrajectory and the actual trajectory obtained from the simula-tor. The error measurements and the estimated trajectory areaveraged over 50 instances, each estimated from a model in-stance that was trained with a set of 50 trajectories sampledin the simulator by following the action sequence.

For our first set of experiments we generated one actionsequence for the movement of the robot’s left arm, whichresulted in a 3-dimensional state feature vector for the jointposition of the wrist. The model was trained with 50 tra-jectories sampled by following the previously acquired pol-icy in the simulator. Subsequently, a trajectory was ap-proximated using the trained model. Figure 4 illustratesthe performance of the algorithm in terms of a set of er-ror measurements. Given the n-length sampled trajectoryTrs = {ss,0, ss,1, . . . , ss,n}, the n-length approximated tra-jectory Tra = {sa,0, sa,1, . . . , sa,n}, and the effect of eachaction; i.e. the true action, stated as Δss,t = ss,t − ss,t−1

and Δsa,t = sa,t − sa,t−1 for the sampled and the approxi-mated trajectories respectively, the error metrics are definedas follows:

• The action metric identifies the divergence of the true ac-tion from the intended action. Specifically, we denote theaction metric by:

∣∣∣ 1

n− 1

n∑t=0

||at −Δss,t||2∣∣∣−

∣∣∣ 1

n− 1

n∑t=0

||at −Δsa,t||2∣∣∣

• The minmax metric measures whether an approximatedstate is within the minimum and maximum boundsof the sampled state. Assume a set of m sampledtrajectories {s(i)0 , s

(i)1 . . . , s

(i)n }mi=1 and that errmin =

||min(s([1:m])t )− sa,t||2 and errmax = ||max(s

([1:m])t )−

sa,t||2 are defined as the differences for sa,t from the min-imum and the maximum bounds respectively, thenminmax =⎧⎨⎩0, min(s

([1:m])t ) ≤ sa,t ≤ max(s

([1:m])t )

min(errmin,

errmax)otherwise

• The delta metric measures the difference of the approxi-

(a) (b)

Figure 5: Correlation between accuracy of the model, thenumber of cases retained in the case base and the numberof approximation failures, where accuracy is traded for effi-ciency. (a) Accuracy of the approximated trajectory and (b)correlation between the approximation errors, the number ofapproximation failures, and the number of cases are plottedas a function of the radius employed by the neighborhoodalgorithm.

mated and the sampled true action. It is given by:

1

n− 1

n∑t=0

||Δss,t −Δsa,t||2

• The position metric analyzes similarity of the resultingstates and is given by:

1

n− 1

n∑t=0

||ss,t − sa,t||2

We set τ = 0.005, σ = 0.001 and ρ = 0.97. Further-more, we set the retrieval method to the radius-based algo-rithm with the radius set to 1 mm.

5 ResultsThe results in Figure 4(a) show that the model captures theamount of deviation from the intended action quite accu-rately (see action metric). The fact that the minmax erroris relatively small proves that the one-step sequences man-age to stay in close proximity of the experienced bounds.However, by requiring the joint displacement to be withinthe bounds of the sampled states, the minmax metric makesa stronger claim than the action metric which only concedesthat the motion deviated from the intended action. An evenstronger claim is made by the delta metric which analyzesthe accuracy of the motion. As expected the delta error iscomparatively larger since in a stochastic environment it isless likely to achieve the exact same effect multiple times.Though not shown in Figure 4(a) the position error is evenmore restrictive as it requires the location of the joints to bein the exact same place and accumulates rapidly after thefirst deviation of motion. Nevertheless, as can be seen inFigure 4(b), the approximated trajectory is almost identicalto that of the sampled trajectory.

The case retrieval method used for this experiment wasthe radius-based neighbor algorithm, a variant of k-nearest

395

Figure 6: Comparison of the trajectory extracted from controlling the left arm in the simulator and the trajectory approximatedby the model after it was trained with ten distinctive trajectories. The deviations of the wrist joint position in the x-, y-, andz-directions are scaled to cm. The approximated trajectory shown is an average over 20 instances.

neighbor that finds the cases within a given radius of thequery case. By allowing the model to only select caseswithin the current state’s vicinity, joint limits can be mod-eled. The drawback is that, in the case of uncertainty in themodel, it is less likely to find any solution. To gain a betterunderstanding, we compared the effect of varying the train-ing radius on the model’s performance. Otherwise the setupwas similar to that of the previous experiment. Figure 5(a)shows that as the size of the radius increases, the accuracy ofthe model drops. This outcome is intuitive considering thatas the radius increases, the solution includes more cases thatdiverge from the set of states representing this step in the tra-jectory. At the same time the number of cases retained by thecase base decreases. Therefore, the found solutions are lesslikely to be exact matches to that state. Interestingly, eventhough the delta measure is more restrictive than the minmaxmetric, it does not grow as rapidly. This can be explained bythe reuse method of the case base, which requires cases tohave similar actions to be considered part of the solution.This further suggests that the model correctly predicts theeffects of the actions and thus generalizes effectively. Asis evident from Figure 5, the accuracy of the model is cor-related with the number of approximation failures, definedas incomplete trajectories. The effect is that more attemptshave to be made to estimate the trajectory from the modelwith smaller radii. This makes sense since deviations fromthe sampled trajectories are more likely to result in a sce-nario where no cases with similar states and actions can befound. In this experiment the problem is compounded sinceno attempt at exploration is made. There is an interestingspike in the number of failures when the radius is set to 10mm (see Figure 5(b)). An explanation for this phenomenonis that for small radii, there is a lesser likelihood for large de-viations. Therefore, the solution mostly consists of the set ofcases that were sampled at this step in the trajectory. On theother hand with larger radii, the solution may include morecases which allows for generalization from other states. Formedium radii, there is a greater potential for the occurrenceof larger deviations when the radius does not encompass suf-ficient cases for good generalization.

A third experiment illustrates the performance of the al-

Figure 7: The accuracy of the model after ten training sam-ples. The results are averages over 20 approximation tra-jectories. The minmax error was calculated by sampling 20trajectories in the simulator for the validation.

gorithm on estimating a trajectory that the model had notpreviously encountered during learning. For this purposethe model was trained on a set of ten distinctive trajectorieswhich were derived from following unique policies. Thepolicies were not required to be of the same length, howevereach trajectory started from the same initial pose. The modelwas then used to estimate the trajectory of a previously un-seen policy. We continued to utilize the radius-based neigh-boring algorithm as the similarity measure and set the radiusto 15 mm. We also set ρ = 0.95 to allow for more deviationfrom the action. This was necessary, as the model needed togeneralize more on unseen states than were required in theprevious experiments. There were also far fewer instancesto generalize from since the model was only trained on 10trajectories. Furthermore, only 30% of the experienced in-stances were retained by the model’s case base. A com-parison of the approximated and the sampled trajectory isdepicted in Figure 6 which shows that the model is capa-ble of capturing the general path of the trajectory. Figure 7quantitatively supports this claim. Even the position metric,which is the most restrictive, shows that the location of therobot’s wrist deviates on average by only 3.45 mm from theexpected location.

396

6 Related WorkPrior work on model learning has utilized a variety of rep-resentations. Hester and Stone (2009) use decision treesto learn states and model their relative transitions. Theydemonstrate that decision trees provide good generalizationto unseen states. Diuk, Li, and Leffler (2009) learn the struc-ture of a Dynamic Bayesian Network (DBN) and the condi-tional probabilities. The possible combinations of the inputfeatures are enumerated as elements and the relevance of theelements are measured in order to make a prediction. Boththese model learning algorithms only operate in discretizedstate and action spaces and do not easily scale to problemswith inherently large or continuous domains.

Most current algorithms for reinforcement learning prob-lems in the continuous domain rely on model-free tech-niques. This is largely due to the fact that even if a model isknown, a usable policy is not easily extracted for all pos-sible states (Van Hasselt 2012). However, research intomodel-based techniques has become more prevalent as theneed to solve complex real-world problems has risen. Or-moneit and Sen (2002) describe a method for learning aninstance-based model using the kernel function. All expe-rienced transitions are saved to predict the next state basedon an average of nearby transitions, weighted using the ker-nel function. Deisenroth and Rasmussen (2011) present analgorithm called Probabilistic Inference for Learning Con-trol (PILCO) which uses Gaussian Process (GP) regressionto learn a model of the domain and to generalize to unseenstates. Their method alternately takes batches of actions inthe world and then re-computes its model and policy. Jongand Stone (2007) take an instance-based approach to solvefor continuous-state reinforcement learning problems. Byutilizing the inductive bias that actions have similar effectin similar states their algorithm derives the probability dis-tribution through Gaussian weighting of similar states. Ourapproach differs from theirs by deriving the successor stateswhile taking the actions performed in similar states into ac-count. The probability distribution is then determined by aHidden Markov Model.

7 Conclusion and Future WorkIn this paper, we presented CASML, a model learner forcontinuous states and actions that generalizes from a smallset of training instances. The model learner integrates a casebase reasoner and a Hidden Markov Model to derive succes-sor state probability distributions for unseen states and ac-tions. Our experiments demonstrate that the learned modeleffectively expresses the system dynamics of a humanoidrobot. All the results indicate that the model learned withCASML was able to approximate the desired trajectory quiteaccurately.

In future work we are planning to add a policy for revis-ing cases in the model’s case base to account for the innatebias towards the initially seen instances. We believe this willlead to an improvement in the case base’s performance sincefewer cases will be required to achieve good generalization.In addition we plan on evaluating the performance of themodel learner in combination with a value iteration planner

for generating optimal policies in continuous state and ac-tion spaces.

ReferencesAamodt, A., and Plaza, E. 1994. Case-based reasoning:Foundational issues, methodological variations, and systemapproaches. AI Communications 7(1):39–59.Deisenroth, M., and Rasmussen, C. E. 2011. PILCO: Amodel-based and data-efficient approach to policy search.In Proceedings of the International Conference on MachineLearning (ICML-11), 465–472.Diuk, C.; Li, L.; and Leffler, B. R. 2009. The adap-tive k-meteorologists problem and its application to struc-ture learning and feature selection in reinforcement learning.In Proceedings of the International Conference on MachineLearning, 249–256. ACM.Hester, T., and Stone, P. 2009. Generalized model learningfor reinforcement learning in factored domains. In Proceed-ings of International Conference on Autonomous Agents andMultiagent Systems-Volume 2, 717–724. International Foun-dation for Autonomous Agents and Multiagent Systems.Hester, T., and Stone, P. 2012. Learning and using models.In Reinforcement Learning. Springer. 111–141.Jong, N. K., and Stone, P. 2007. Model-based exploration incontinuous state spaces. In Abstraction, Reformulation, andApproximation. Springer. 258–272.Kolobov, A. 2012. Planning with Markov decision pro-cesses: An AI perspective. Synthesis Lectures on ArtificialIntelligence and Machine Learning 6(1):1–210.Kuvayev, L., and Sutton, R. S. 1996. Model-based rein-forcement learning with an approximate, learned model. Inin Proceedings of the Ninth Yale Workshop on Adaptive andLearning Systems. Citeseer.Molineaux, M.; Aha, D. W.; and Moore, P. 2008. Learn-ing continuous action models in a real-time strategy envi-ronment. In FLAIRS Conference, volume 8, 257–262.Ormoneit, D., and Sen, . 2002. Kernel-based reinforcementlearning. Machine Learning 49(2-3):161–178.Sutton, R. S., and Barto, A. G. 1998. Introduction to rein-forcement learning. MIT Press.Van Hasselt, H. 2012. Reinforcement learning in contin-uous state and action spaces. In Reinforcement Learning.Springer. 207–251.

397

Documents

Learning Continuous State/Action Models for Humanoid Robotsial.eecs.ucf.edu/pdf/Sukthankar-Astrid-FLAIRS2016.pdf · 2016. 6. 1. · This section describes Continuous Action and State