Decision-Making under Uncertainty for Social Robots · Social acceptability however, is a subjective measurement of performance ... Next, we give an overview of the robots hardware

MSc Artificial IntelligenceIntelligent Systems Track

Decision-Making under Uncertainty forSocial Robots

by

Richard Pronk

6382541

Masters thesis42 EC

Masters programme Artificial IntelligenceUniversity of Amsterdam

A thesis submitted in conformity with the requirements for the degree ofMSc. in Artificial Intelligence

Supervisors:

Joao Messias and Maarten van SomerenInformatics Institute, Faculty of Science,

Universiteit van AmsterdamScience Park 904, 1098 XH Amsterdam

Defended on August 30, 2016

Abstract

In this thesis, we explore the problem of learning social behaviourfor a semi-autonomous telepresence robot. We extract social cost func-tions from instantaneous and delayed feedback obtained from real-world experiments, so as to establish a performance metric for thebehaviour of our autonomous agent in a social setting. Based on theidentified social cost functions, we extract additional insight regardingthe normative behaviour of the agent as a response to perceived socialcues, and discuss the relevant state features to be used for autonomouscontrol in the task of social positioning during conversation. We alsopropose a discretized proxemics-based state space representation forthis task given the feedback collected during our experiments, andperform a statistical analysis on the relevance of the state features ina group conversation setting.

In the second part of this work, we compare and discuss differentapproaches to the problem of obtaining a control policy for our task ofpositioning of the robot during social conversation. This is a challeng-ing task due to the limited information available to the agent duringexecution. Specifically, we study the performance as well as the easeof implementation of Partially Observable Markov Decision Processes(POMDPs) against a Long Short Term Memory (LSTM) recurrentneural network. We show by validating both methods on real worldexperiments that the LSTM outperforms the POMDP in our task, andoffer explanations for that result which can help guide the deploymentof these frameworks for more general problems of decision-making un-der uncertainty in the field of Robotics.

Contents

1 Introduction 11.1 The TERESA project . . . . . . . . . . . . . . . . . . . . . . . 21.2 Robot hardware and software . . . . . . . . . . . . . . . . . . 31.3 Notation conventions . . . . . . . . . . . . . . . . . . . . . . . 4

2 Social cost functions 52.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Defining the state space . . . . . . . . . . . . . . . . . . . . . 82.3 Initial experiment . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Experimental set-up . . . . . . . . . . . . . . . . . . . 122.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Group conversations . . . . . . . . . . . . . . . . . . . . . . . 162.4.1 Experiment overview . . . . . . . . . . . . . . . . . . . 172.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Decision making frameworks 223.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Markov Decision Processes . . . . . . . . . . . . . . . . 233.1.2 Recurrent Neural Networks . . . . . . . . . . . . . . . 29

3.2 Defining the state and action space . . . . . . . . . . . . . . . 353.3 Semi-asynchronous execution . . . . . . . . . . . . . . . . . . 383.4 Initial experiment (data-gathering) . . . . . . . . . . . . . . . 403.5 Implementation of frameworks . . . . . . . . . . . . . . . . . . 41

3.5.1 Partially Observable Markov Decision Process (POMDP) 413.5.2 Long Short Term Memory (LSTM) . . . . . . . . . . . 46

3.6 Comparing POMDP and LSTM . . . . . . . . . . . . . . . . . 48

4 Conclusion 51

5 Future Work 52

A Deploying the TERESA code 54A.1 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

B Value tables 58

i

1 Introduction

Robotics is becoming more and more popular in real-world settings [27], andhave shown the potential to assist in many real-world tasks e.g., helpingthe elderly [5]. In these kinds of tasks, robots are acting in open human-populated environments which has shown to be difficult due to the highlyunpredictable structure of such environments [28]. Next to the issue of actingin a noisy environment, behaving in a social intelligent manner has also shownto be an important factor for the acceptance of an autonomous agent in ahuman-populated environment [31,40].

This thesis revolves around the TERESA1 project, a project aiming todevelop a telepresence system meant to assist the elderly in maintaining theirsocial activities despite their reduced mobility. To achieve this, a telepres-ence robot is used which is equipped with semi-autonomous behaviours (i.e.behaviours for which the autonomous agent takes over some of the low leveltasks) to assist the elderly in controlling the robot. It has been shown thatelderly users have difficulties with controlling the robot while simultaneouslyconversing [11], but semi-autonomy could relieve them of the cognitive effortof controlling the robot.

The autonomous behaviour of the TERESA robot must be socially intelli-gent, meaning that the robot should behave in a socially acceptable manner.Social acceptability however, is a subjective measurement of performanceand it is therefore hard to define in a way that can be used as a basis forthe autonomous decision-making of a robot. Real world experiments aretherefore needed in order obtain quantitative measures of socially acceptablebehaviour for our robot.

Next to the issue of defining a performance measure for the robot, therobot itself has also incomplete information regarding its situation. Thismeans that the robot does not know the true state of its environment and itis therefore possible for the robot to receive false or incomplete information.The algorithms created for the autonomous agent should therefore be ableto handle uncertainty due to the imperfect sensors of the robot.

This thesis is divided into two main parts: in the first part we explore thedefinition of social correctness for our robot from which we will extract socialcost functions. In the second part, we discuss the development of a decision-theoretic controller for the behaviour of the robot during conversation andcompare it to a state of the art recurrent neural network on the problem.

In this section, we first give a general introduction regarding the TERESA

1TERESA: TElepresence REinforcement-learning Social Agent. Project reference:FP7-ICT-2013-10. Web: http://www.teresaproject.eu

1

http://www.teresaproject.eu

project in Section 1.1, where we explain the motivation and contributions ofthis thesis to the TERESA project. Next, we give an overview of the robotshardware and software in Section 1.2 and finally provide an overview of thenotations used regarding the robot in Section 1.3.

1.1 The TERESA project

The TERESA project aims to develop a socially intelligent semi-autonomoustelepresence system, to allow a remote controller be more physically presentthan with standard teleconferencing while minimizing the amount of inputrequired from the controller.

The idea is that the TERESA robot will take over the low level tasks suchas navigation and positioning so that the controller only has to indicate whohe/she wants to communicate with. Socially acceptable behaviour for therobot is however difficult to manually define and should therefore be learnedfrom real world experiments.

The TERESA project contains many challenges such as controlling themotion of the robot in a socially intelligent manner, and also perceptualproblems such as the robust detection of the persons around the robot, thedetection of social cues such as hearing problems and extracting feedbackfrom facial expressions.

This thesis focuses mainly on the social conversation aspect of the project,where we explore the social correctness of the positioning and behaviours ofthe robot during conversation by performing real world experiments.

This problem could in principle be solved by providing solutions to theproblem from the field of Robotics or Control Theory with a hard coded orexpert defined measure of social correctness. The TERESA project, however,aims to use reinforcement learning to overcome these challenges as thesemethods are especially suited to learn policies for decision-making underuncertainty that are difficult to prescribe manually.

The main contributions of this thesis to the TERESA project are byproviding:

- Social cost functions which are required to define the social correctnessduring conversation.

- Comparison between two different types of low level controllers to assistfuture development of such a controller for the TERESA project.

2

(a) Front view of the robot (b) Side view of the robot

Figure 1: Front and side view of our robot

1.2 Robot hardware and software

The robot that we are using is built upon the commercially available Giraffplatform2 with an additional computer mounted on the side of the robot. Inour experiments, the computer present within the Giraff robot is only usedfor the video conferencing software where this computer does not control therobot although it is capable of doing so.

The Giraff robot is by default equipped with a fish-eye camera and anon-board microphone which are again only used by the video conferencingsoftware and are not used by our algorithms directly. Next to the cameraand microphone of the Giraff robot, these are the additional sensors that areconnected to the TERESA robot:

• 2 Hokuyo UST 10X laser rangefinders

• Asus Xtion PRO LIVE (RGBD sensor)

• VoiceTracker II array microphone

• XSense MTi-30 AHRS gyroscope

The 2 additional laser range-finders provide a 360 degrees field of viewaround the robot, whereas the additional Kinect sensor is only facing forward(see Figure 1). The additional computer mounted on the side of the robot

2http://www.giraff.org/

3

runs on Ubuntu 14.04 LTS (Trusty) and runs on the Robot Operating System(ROS) [25] version Indigo as a framework for our software.

In our experiments, 3rd party detection software from the SPENCERproject3 [19, 20] is used for detection of the interaction targets around therobot.

The self-localization of the robot within the room is done by using HectorSLAM [18] which also simultaneously maps the room. This mapping and selflocalisation is mainly used for the post annotation of the experiments and isnot used by the algorithms discussed throughout this thesis.

1.3 Notation conventions

In this section, we provide some of the notations used throughout this thesisrelated to the positioning and orientation of the robot and the interactiontarget. Throughout this thesis, the robot is denoted as R and the interactiontarget is denoted as T . Furthermore, we denote positions as p ∈ R2 := [x y]T

and postures as q ∈ R2 × [−π, π] := [pT θ]T . The position of the interactiontarget for example, can therefore be represented as pT . This representationhowever, does not yet include in what frame of reference this position of theinteraction target is given.

There are two frames of references used to represent the positions ofobjects of interest in our control tasks; the world frame w and the robotframe r. The world frame is fixed to a static point within the operationalenvironment of the robot (e.g. an arbitrarily chosen point within a room).

The robot frame on the other hand is fixed to the robot itself (with thex-axis facing forward), where the position of the robot is represented withinthe world frame. The position of the interaction target T can therefore berepresented in both the world frame w and the robot frame r (i.e. pwT for theworld frame and prT for the robot frame).

In Figure 2, an overview of the world frame is given. The posture of therobot and the position interaction target relative to the world frame w andrelatively to the robot frame r are defined as follows:

qwR =

xwRywRθwR

prT =

[xrTyrT

]pwT =

[cos(θ) −sin(θ) xwRsin(θ) cos(θ) ywR

]xrTyrT1

Given this representation, the distance between the interaction target T and

3SPENCER: Social situation-aware perception and action for cognitive robots. Projectreference: FP7 ICT 2011-9. Web: http://www.spencer.eu/

4

http://www.spencer.eu/

xw

yw

R

yr xr

θ

T

ϕ

Figure 2: Overview of the world frame (top down view)

the robot R can be calculated as follows:

d(T ,R) =√

(xwT − xwR)2 + (ywT − ywR)2 ⇔√

(xrT )2 + (yrT )2. (1)

In this thesis, the orientation of the interaction target is not used. Rather,we are interested in the bearing of the robot with respect to the position ofthe interaction target ϕ which can be calculated as follows:

ϕ = arctan(ywT − ywR, xwT − xwR)− θ ⇔ arctan(yrT , xrT ). (2)

As you can see in Equation 2, the bearing with respect to the interactiontarget in the robot frame can be calculated by a straightforward arctangentbased on the position of the interaction target. In the world frame however,also the position [xR, yR] and orientation θR of the robot need to be takeninto consideration during calculation.

2 Social cost functions

One of the objectives of the TERESA project, is to find the optimal positionof the robot with respect to the interaction target (i.e. the distance andbearing) during conversation. In order for an agent to be able to behavein an optimal manner however, it requires a performance measurement thatdescribes the relative utility of the different situations that the robot mayencounter. For example, a performance measurement might describe thatreaching the end of a maze is a desirable state to be in, whereas being in anyother state of the maze is undesired. Knowing this, the agent can optimizeits behaviour by taking the actions that lead to the end of the maze the

5

quickest while at the same time minimizing the number of visited states thatare undesired. Note that a performance measurement describing the unde-sirability of states essentially describes the cost of being in a certain state.These performance measurements are therefore also called cost functions asthey provide a mapping from states to cost (see Section 2.1).

In our task, the agent acts in a human-populated environment wherethe agent must behave in socially acceptable manner. This means that theperformance measurement for our agent should include a measurement ofsocial acceptability. Cost functions that include this measurement of socialacceptability to derive the cost for a state are also called social cost functions.

Although various research has already been performed regarding findingthe optimal position of a robot during human-robot interaction [29, 33], wecan not assume that the same methods can be applied to our telepresencerobot. With a telepresence robot it is for example important that the in-teraction target as well as the pilot are able to clearly see and hear eachother during the teleconference. This requirement might not be present fordifferent types of robots, leading to completely different social cost functions.

In this section, we provide a method for extracting these social cost func-tions from real world experiments in order to obtain quantitative measuresof socially acceptable behaviour for our telepresence robot. First, we extractthe social cost functions given instantaneous negative feedback for a singleinteraction target setting in Section 2.3. Next, in Section 2.4, we extractsocial cost functions for the positioning of the robot within a group conver-sation setting (i.e. situations with more than one interaction target). Here,we extract the social cost functions from delayed feedback provided by theinteraction target at the end of each episode, consisting of an evaluation ofthe overall quality of the behaviours of the robot.

2.1 Background

A cost function provides a mapping from a state s ∈ S to a cost C : S → Dwhere this cost D could either have a discrete (e.g. D ∈ low, normal, high)or a continuous (e.g. D ∈ R) representation. In our particular setting, wehave two features that influence the cost of each state. Namely, a continuouscost for the bearing as well as a continuous cost for the distance with respectto the interaction target. This means that we have cost function mappingfrom the bearing state sb ∈ Sb to a cost Cb : Sb → R and a mapping fromthe distance state sd ∈ Sd to a cost Cd : Sd → R with S = Sd×Sb. The costfor state s ∈ S is then given by some combination of the costs for sb ∈ Sband sd ∈ Sd (e.g. the average between the two costs). See Section 2.2 for adetailed description of the state space S in our particular task.

6

(a) The TERESA robot in a simulationenvironment

(b) Obstacle avoidance cost function.Higher costs are shown in red, and lowercosts in blue.

Figure 3: Obstacle avoidance cost function used for navigation within theTERESA project.

These cost functions can be used by decision theoretic solvers to derivean optimal policy where the cost functions provide the cost for being in acertain state. The task of navigating a robot through a room for exampleuses a cost function for obstacle avoidance, where there is a high cost aroundobjects and a low cost the in areas without any obstacles (see Figure 3). Inthis figure, the colours surrounding the detected objects (i.e the walls andthe poles of the table) indicate the cost of the position of the robot. Giventhis cost function, a straightforward path planning algorithm can then beapplied to calculate the route with the lowest cost.

When reasoning within a human-populated environment however, thingsget significantly more complicated as we cannot fully prescribe the rulesaccording to which the agent should behave around humans. For obstacleavoidance, a obstacle buffer cost can defined so that the autonomous agentcan safely navigate around obstacles without hitting anything. While work-ing in a human-populated environment however, the social acceptability ofthe surrounding humans should also be taken into consideration.

Social navigation is a very active research area that also incorporatessocial cost functions to allow for safe and socially intelligent navigation in ahuman-populated environment.

Social cost functions could for example be used to incorporate humaninteractions into the planning phase, where methods learning from expertexamples such as inverse reinforcement learning could be used [23]. Duringinverse reinforcement learning, demonstrations by an expert are used to learnthe robot’s behaviour when faced with different state features. Here, we try to

7

extract a policy that mimics the observed behaviour (i.e. the demonstrationswhich are considered to the optimal behaviour) and use this to represent itas a cost function that describes a measurement of social correctness (i.e.what behaviours should be performed when acting in a human populated-environment and facing the different state features).

Another method could be by adding a personal space cost function toassist the planner in avoiding the personal space of a person, where for ex-ample Gaussians functions centered on the person can be used to define thesocial cost [17].

In our problem however, we are not dealing with a navigation task butrather with the positioning of the robot during conversation. Here, not onlythe distance d(T ,R) is of importance but also the bearing ϕ of the robot Rwith respect to the interaction target T .

The social cost functions will in our particular case be learned from in-stantaneous feedback as well as delayed feedback given over an episode. Inthe instantaneous feedback case, we have key presses indicating negative feed-back for a specific state. This negative feedback over states is then used toprovide a mapping from states to cost. In the delayed feedback case on theother hand, a grade based on the social correctness of the agent is given overan episode containing multiple states. In that case, we assign the grade thatwas given in that episode to all of the visited states, in order to provide amapping from states to cost.

2.2 Defining the state space

Before we can extract the social cost functions, we first need to define thestate space for our task. Previous research by our partners at the university ofTwente [34] has already shown that the relevant features for the positioningof the TERESA robot during conversation are given by the bearing anddistance relative to the interaction target. Using these features, a valid staterepresentation of the robot in the conversation setting could in principle bedefined as the real valued distance and bearing towards the interaction target.It has however been shown that due to the so called “curse of dimensionality”,working in a continuous state space in a real-world robotic applications canlead to intractable problems for decision theoretic solvers [10].

A solution to this problem is to discretize the state space. An exampleof a popular discretized environment is the grid world, where the state spaceconsists of the cells within a grid (see Figure 4). In this world, the task ofthe robot R might be to approach the goal G where the agent can only belocated in one of the cells.

Between humans there are already certain socially defined distance ranges.

8

G

R

Figure 4: Example of a discretized environment (grid world)

Take for example the phrase “You are invading my personal space”, whichindicates that there is a boundary at what you consider to be your personalspace. The distance boundary of what we consider to be comfortable dependson the nature of the relation between the participants during conversation.Two very good friends for example, might want to conduct a conversationwhile standing very close together. When the participants do not know eachother on the other hand, this same distance might be uncomfortable. Thestudy of proxemics [12] has identified a discrete set of distance intervals fordifferent types of interaction between humans. These distances are definedas the personal, social, and public proxemic zones.

This lends support to the hypothesis that a similar discrete representationof the distance space would be a natural basis for the domain of this variablewithin our social conversation task. These identified proxemic zones however,rely on the fact that the interaction took place between people rather than theusers interacting through a robot. To overcome this issue, we have attemptedto learn a discretized proxemics-based representation of the distance spacefor a human interacting with a telepresence robot.

During our experimental runs (see Section 2.3), we collected the distancesbetween the robot and the interaction target, as measured by the robotslaser range-finder. For the purpose of state space identification, the relevantportion of this dataset concerns the segments of the experimental runs wherewe asked the subjects to approach the robot to what they considered to bethe normal, minimum and maximum interaction distances. Note that theseinteraction distances will be used as a reference for the autonomous controlof the robot. In order to identify the interaction ranges, we extracted theset of final distances of the subjects after their approaches to the robot, asP = p1, . . . , pN with pi ∈ R. Here, the final distances are defined as thedistance between the interaction target and the robot after the participantfinished its approach towards the robot. Based on the final distances ofthese approaches P , we run K−means clustering [13] with K = 3 on datasetP to come up with K points to identify the centroids of these interactiondistances. The centroids of the resulting clusters, based on 31 approaches,

9

are as follows:

ρclose = 0.71m, ρnormal = 1.16m, ρfar = 1.68m.

We then define the discrete distance state of the robot during social conver-sation as Sd = dpersonal, dsocial, dpublic. Let Fρ : dpersonal, dsocial, dpublic →ρclose, ρnormal, ρfar be a bijection where Fρ(dpersonal) = ρclose, Fρ(dsocial) =ρnormal and Fρ(dpublic) = ρfar . Then, during execution, the distance state ofthe robot can be determined as:

arg minsd∈Sd

|ρt − Fρ(sd)| (3)

where ρt is the real-valued distance at time t.Next to the discretized distance state space based on the identified inter-

action ranges, we also discretize the bearing relative to the interaction targetto extract a discrete social cost function for the bearing that can be used bydecision theoretic solvers during planning. A straightforward solution wouldbe to divide the full angular range into equally-sized intervals. This wouldhowever lead (depending on the size of these intervals) to either many stateswith high precision or a small amount of states with low precision. You couldhowever argue that because we are working with a telepresence robot, thatthe front of the robot should have a high precision because of the viewabilityof the screen. The social cost of the interaction target standing beside orbehind the robot should therefore not differ much as the screen can not beviewed and makes teleconferencing impossible.

Based on this idea, we have divided the angular range in front of therobot into steps of 10 angular degrees in width up to 50. That is, let

Φ := −50,−40,−30,−20,−10, 0, 10, 20, 30, 40, 50

and Fφ : sbii=0,...,10 → Φ where sbii=0,...,10 =: Sb is the discrete bearingstate space. Then the bearing state at time t is defined as:

arg minsb∈Sb

|Fφ(sb)− deg(φt)| (4)

where φt is the bearing at time t, and deg(φt) = φt × 180π

.Another important fact regarding the bearing is that the robot is symmet-

ric and the conversation task, a priori, has no reason to favour one angulardirection over the other. Therefore we assume that the social cost shouldalso be symmetric with respect to the bearing (i.e. the cost at the left of therobot should equal the cost at the right of the robot). This leads to requiring

10

Figure 5: Angular states representation in different shades of gray

less experimental data and removes possible biases during the experiments.During execution, however, the actions of the robot naturally should dependon the actual direction of the interaction target. In Figure 5, the differentbearing states are represented in grayscale, and this symmetry can also beseen.

In order to get the full state space of the robot, the possible state spacesbased on distance Sd and bearing Sb are combined to form the state spaceS. This gives us the combination of possible distances in Sd and bearingstates in Sb which provides us a total of 18 state spaces in our state spaceS = Sd × Sb.

2.3 Initial experiment

First, an initial experiment was conducted to learn social cost functions forthe TERESA robot based on instantaneous negative feedback. These socialcost functions are used to provide a performance measurement for the robotin a social setting. For instance, we want to learn at what distance the robotshould approach the interaction target for different situations e.g. whetheror not there are hearing problems detected. The detection of these kinds ofsituations is part of the research of our partners at the University of Twente.We however want to learn that when such a situation occurs, what is theoptimal distance with respect to the interaction target. Another importantfeature for social positioning, as indicated by our partners [34], is the bearingof the robot with respect the interaction target during a conversation.

Given these two features (i.e distance and bearing), we want to learnfor each state s ∈ S, how socially acceptable it is to be in that state. This

11

Figure 6: Layout of the initial experiment

measurement of correctness will be extracted from negative feedback providedby the pilot, and will be turned into cost functions to be able to be used bydecision theoretic controllers.

2.3.1 Experimental set-up

In the initial experiment, a “Wizard of Oz” setup is used. In this type ofsetup, the robot is controlled by an expert located in an isolated area. Inour experiment, the expert acts with different degrees of quality in order totrigger a response of the pilot to the behaviour of the robot. The layout ofthe experiment (two rooms with an additional isolated area for the expert)can be seen in Figure 6. In this figure and throughout the rest of the thesis,the following definitions are used;

- pilot : the person sitting behind the computer and communicatingthrough the robot,

- interaction target : the person physically with the robot and is commu-nicating with the pilot

- experimenter : the person setting up the robot and instructing the par-ticipants

- expert : the person controlling the robot

As can be seen in the experiment layout, also an additional camera is installedin the interaction room where this camera is used to provide the expert with

12

a better overview of the room and is additionally used for later validation ofthe recorded experimental data.

The pilot is asked to give feedback when the robot is doing somethingwhich is for him/her socially unacceptable. This feedback is given by pressinga button on the keyboard located in front of the pilot.

During the experiment, we alternate between two different situations.In the first of these, the interaction target is asked to approach the robotas described in Section 2.2. Alternatively, the expert controlling the robotapproached the interaction target with different degrees of quality. Afterthe approaches, the visitor and interaction target continued the conversationwhere the expert executed various socially acceptable and non acceptablebehaviours. Acceptable behaviours include; maintaining the correct bodypose and correctly changing the distance based on social cues (e.g. hearingproblems) whereas the non acceptable behaviours include; suddenly turningaway and randomly changing the distance to the interaction target.

In order to explore how the robot should react when a social cue is trig-gered (e.g. hearing problems), artificial background noise (i.e. sounds of acafeteria) was introduced during the experiment to simulate such a situation.At the end of the experiment also an interview was conducted to obtain someadditional insight on the behaviour of the robot.

2.3.2 Results

The first result from our initial experiment, are the identified centroids ρ ofour identified distance ranges for our distance states sd ∈ Sd as discussed inSection 2.2. Here, the final distances of the approaches from the interactiontargets P were used to define our state space based on the euclidean distancerelative to the interaction target Sd. This process was required because wedid not know at what distance the robot should approach the interactiontarget for the different interaction ranges (i.e. close, normal and far).

Given these centroids ρ, the distance states sd ∈ Sd can be defined andvisualized as shown in Figure 7a. In this figure, the interaction target is indi-cated by the purple human like shape where the arrow indicates its orienta-tion. The different distance states sd ∈ Sd with Sd = dpersonal, dsocial, dpublicare indicated by the circles surrounding the interaction target in orange, yel-low and red respectively. Note that the red area extends throughout theentire operational environment of the robot but that the plot has been cutoff into a square of ρfar by ρfar for visualization purposes.

In the presence of background noise, the interaction targets approachedthe robot significantly closer than their normal approach distance. Here,the participants did not receive any explicit instructions but rather acted

13

(a) Normal situation (b) Hearing problems detected

Figure 7: State representation for the distance

on the presence of the artificially introduced background noise (i.e. soundsof a cafeteria). All of the interaction targets (re-)approached the robot ata distance between their normal approach distance and the distance theydefined as close but still socially acceptable. Also, because we did not instructthe participant to approach the robot at a certain distance range, we knowthat the distance at which they approached the robot (while backgroundnoise is being played) should be their normal approach distance.

Given the final distances of these approaches in the presence of back-ground noise Pnoise = p1, . . . , pN with pi ∈ R, we can update the centroidρnormal to identify the proxemic-based distance ranges when hearing prob-lems are detected. The centroid ρ′normal is calculated as the average distanceof the approaches in Pnoise (i.e. ρ′normal = avg(Pnoise) = 0.75m), whereas thecentroids ρclose and ρfar are not altered (i.e. ρ′close = ρclose and ρ′far = ρfar).

ρ′close = 0.71m, ρ′normal = 0.75m, ρ′far = 1.68m.

When visualizing the distance ranges (see Figure 7b), you can see thatthe ranges are located significantly closer to the interaction target during thepresence of background noise. This agrees with our intuition that the robotshould approach the interaction target closer when the interaction target hasdifficulties hearing the pilot. These distances ranges can again be used bythe robot to know at which distance it should approach the interaction targetwhen hearing problems are detected.

For the second result from our initial experiment, we focus on extractingsocial cost functions given the negative feedback provided by the pilot. Inour conducted experiments, the test participants have pressed the negativefeedback button a total number of 65 times. As not every test subject pressedthe feedback button an equal number of times, normalization is applied to

14

distribute the weights of the key presses more fairly. This means that 2 keypresses for one participant are assumed to be as informative as the 8 keypresses of another participant.

(a) Normalised number of key presses (b) Cost function based on distance

Figure 8: negative feedback from pilot

In Figure 8a, the normalised negative feedback (button presses) of thepilots are plotted against the states of the robot during those key presses.As can be seen from this figure, there is a very sparse distribution of keypresses over the states. The absence of key presses would indicate that therobot is performing in a socially appropriate way. However, our intuitiontells us that having a larger angular offset towards the interaction targetshould correlate with negative feedback. This intuition is also backed up bythe findings of our partners [34] and will also be shown in Section 2.4.2.

However, when we look purely at the distribution of explicit feedbackfor the different distances (the key presses for the different angular offsetssummed together), a clear distribution that our intuition also supports canbe observed. In Figure 8b,the distance of the robot relative to the interactiontarget is plotted against the normalised explicit feedback. Here we can seethat based on the explicit feedback, being in the interaction range identifiedas far yields a relatively high probability of being in a socially unacceptablesituation. Being in the interaction range identified as the normal conversationdistance on the other hand, yields the smallest probability of being in asocially unacceptable situation.

This distribution can be used as a cost function for the Euclidean distanceof the interaction target relative to the robot, and is visualized in Figure 9. InFigure 9a, we show an approach to the interaction target that was performedwhile the robot was being controlled by the expert. The distance at which

15

(a) pre-Approach (b) Approach finished

Figure 9: Social cost based on distance from interaction target

the expert finishes the approach can be seen in Figure 9b, and lies within thenormal distance state where our defined cost function also gives the lowestcost.

To conclude, we obtained the following two results from our initial ex-periment; 1) we identified the centroids of our identified distance ranges forour distance state Sd and 2) extracted a social cost function based on theEuclidean distance relative to the interaction target. Given these results, weobtained a performance measurement with respect to the distance relative tothe interaction target, where we can see that we obtain the highest reward(i.e. lowest cost) when the distance lies within the normal interaction range.

2.4 Group conversations

Our partners at the university of Twente have already conducted an experi-ment that focused on social positioning of the TERESA robot within a groupconversation setting [34]. During this experiment, our partners focused onstudying the proxemics in the social conversation task through the theoryof F-formations [9] (i.e. different spatial arrangements for persons duringgroup conversations) for situations with multiple interaction targets. Al-though these experiments did not contain instantaneous feedback from theparticipants (as we had in our experiments described in Section 2.3), they didinclude an a posteriori evaluation of the quality of the interaction behavioursat the end of each episode.

One of their findings include that for a single interaction target setting,a larger bearing towards the interaction target negatively correlates to thesocial correctness of the robot. This would mean that for a single interactiontarget, maintaining an as small as possible bearing would always be the

16

socially optimal behaviour with respect to the bearing. The bearing alonehowever, does not describe the full cost function for the social correctness ofthe positioning of the robot (i.e. it also includes the cost for the distance).Here, the utility of different bearing states will influence the approach policyof the robot. For example, with a high cost on the bearing and a low cost onthe distance, we expect that the robot will first turn towards the interactiontarget (i.e. to minimize its bearing cost) and afterwards drive to the optimaldistance towards the interaction target. When we have a high cost on thedistance and a low cost on the bearing on the other hand, the robot mightfirst drive towards the interaction target and afterwards minimize its bearingso to obtain a faster trajectory at the expense of being turned away from theinteraction target for a short period of time.

In this section, we focus on extracting social cost functions for the bear-ing within a group conversation setting given the dataset provided by ourpartners. Additionally, we perform a statistical analysis on the relevance ofthe state features.

A detailed description of the experimental setup and results that wereextracted from the experiment by our partners can be found in Deliverable3.1 [34] of the TERESA project.

2.4.1 Experiment overview

The experiment performed by our partners, revolved around the social po-sitioning of the TERESA robot during a group conversation setting. Thegroup conversation consisted of 3 interaction targets and the robot itself.The interaction targets were located in pre-indicated positions whereas therobot was allowed to move around freely. In this experiment, the partici-pants received the task to solve a murder case where the controller of therobot could retrieve additional clues by driving to indicated areas in the room.An overview of the interaction area including the different areas is given inFigure 10. During the experiment, 9 approach/converse/retreat cycles pergroup were performed with a total of 14 groups. At the end of each cycle, theusers scored the quality of the interaction on a 1 - 7 numerical scale where7 represents an interaction that is completely socially acceptable, and 1 asocially unacceptable interaction. This means that we do not have explicitfeedback for the social correctness of a certain state and have to generalizethe grade over the states the robot visited during that interaction phase.

The tracking of the interactions targets and the robot was done by anOptiTrack motion capture system4 using 8 infrared cameras giving a very

4www.naturalpoint.com/optitrack/

17

Figure 10: Overview of the interaction area. On the circle in the middlethe positions of the Interaction Targets are indicated (IT1, IT2, IT3), thesewere projected using a projector mounted to the ceiling, but only in betweenthe approach/converse/retreat cycles. The rectangles near the border ofthe interaction area indicate the positions of the markers A- H. C1 and C2indicate the positions of the cameras. Note: Figure and description arereprinted from Deliverible 3.1 [34] of the TEREA project.

high accuracy localization. The available data includes the positions as wellas the orientations of the interaction targets and robot at a high frequencyrate during the whole experiment. This data is also annotated, indicatingthe time steps of the different cycles and the grades belonging to those cycles.

2.4.2 Results

In our research we are interested in the social correctness during the conver-sation phase, the other two interaction phases of the cycle (i.e. the approachand retreat phase) will therefore not be processed in this section. Also, be-cause the grades are given over the entire interaction phase, we have to gen-eralize the grade over the visited states during that interaction phase. Thisgeneralization is done by creating a dictionary for the states, where eachtime a state was visited in a interaction phase, the grade for that interactionphase (that was given after the cycle had ended) is added to the dictionary.The social correctness values of different states are compared based on thisdictionary.

This experiment was performed on group conversations whereas our staterepresentation Sb expects a single interaction target. Therefore in order toextract a social cost function for the positioning of the robot with respect to asingle interaction target within a group conversation, the group conversation

18

setting should be converted to our state space Sb. In this section, we discussand compare two possible methods for this conversion:

1. Process the position and grade for each person individually

2. Use the position of the center of the group and combine their grades

(a) Individual bearing state (b) Group bearing state

Figure 11: Cumulated probabilities over grades as a function of the bearingstate (a) of each individual participant and (b) with respect to the geometriccenter of the group. The probabilities are represented in grayscale, wherea value of 1 corresponds to white (the cumulative probability of score 7 isalways 1), and a value of 0 corresponds to black. The expected value of thegrade for each class is denoted by a blue diamond.

In Figure 11, the cumulative probabilities for the grades are plotted againstthe angular states for both methods where the expected value of the gradefor each bearing state is indicated by the blue diamond shapes.

The first method would assume that the participants grade the conver-sation based on purely self interest. However, because we are dealing witha group conversation, it might be possible that the participants graded thesocial correctness of the robot with respect to the well being of the group.To test this, we set the following null hypothesis:

H0 := The bearing with respect to the position of an individual participant

makes no difference in the grade regarding the social correctness of

the positioning of the robot in a group conversation setting.

19

More concretely, H0 infers that the grades given for each bearing state in Sbhave the same mean when considering each participant individually.

To test this null hypothesis, we ran the Kruskal-Wallis test where thistest is chosen due to the presence of different sample sizes of grades givenper state. This test checks if samples originate from the same distributionand provides a p-value which indicates the significance level of the results.A null hypothesis is rejected when the p-value < 0.05 and indicates that thesamples did came from different distributions and that the difference wasnot due to random sampling. When the p-value ≥ 0.05 on the other hand,indicates that there is no reason to conclude that we can reject the ideaof the samples originating from the same distribution. Note that when thenull-hypothesis is not rejected, this does not imply that the samples musttherefore originate from the same distribution.

The result of the Kruskal-Wallis test based on hypothesis H0 is p = 0.14and means that hypothesis H0 can not be rejected. We can therefore say thatthe participants in this experiment might have graded the social correctnessof the robot not purely based on self interest.

The second method, already takes into consideration that the participantsare not grading solely on self interest. However, you could argue that perhapsthe bearing of the robot relative to the center of the group has no influenceon the grade at all. To test this, we set the following hypothesis:

H1 := The bearing of the robot relative to the position of

the center of the group makes no difference on the

grades given by the participants.

The Kruskal-Wallis test based on this hypothesis gives p = 1.36e−4 whichmeans that this hypothesis is rejected and that the bearing relative to thecenter of the group does make a difference in the grades given by the partic-ipants.

In the case of method 2, the center of the group can be used as the theoryof F-formations suggests that, in group conversations involving three or morepersons, the participants tend to position themselves facing the geometriccenter of the convex hull of their formation [16]. In Figure 12, a plot of thepositions of the participants and the robot are plotted with an overlay of thethree social spaces: o-space, p-space and r-space of an circular F-formation.Note that the social spaces are not calculated but rather displays the factthat the participants were positioned in a circular F-formation during thisexperiment.

In Figure 13, the social cost for both methods based on the bearing areplotted. The cost is calculated as the expected value subtracted from the

20

(a) Raw positioning data

(b) Social spaces of an F-formationoverlayed on the positions of the par-ticipants and the robot

Figure 12: F-formation during experiment (robot in red, participants in blue,green and yellow and the center of the group is denoted by a cross)

(a) Angular cost for method 1 (b) Angular cost for method 2

Figure 13: Social cost based on the bearing of the robot

highest possible grade for that bearing state. Figure 13a, displays the costbased on the grades on the bearing states for each participant individually.Here, the cost seems to go down as the bearing towards the interaction tar-get becomes larger, this is very counter intuitive and supports hypothesis H0

where we show that the participants might not have graded the performanceof the robot purely based on self interest. In Figure 13b on the hand, wecan see that cost goes up as the bearing towards the center of the group be-comes larger. This agrees with our own intuitive and shows that the bearingrelative towards the center of the group is a good state feature in the groupconversation setting.

21

3 Decision making frameworks

In previous experiments of the TERESA project [35], we observed that thelow level controller in charge of maintaining the correct bearing towards theinteraction target showed unexpected behaviour due to noise in the detectionand identification of the interaction target. The initial controller assumedthe interaction target to be defined as the person closest to the robot. Thisassumption however, led to the issue of the robot turning away whenever itreceived a false positive detection closer than its previous detection. Thisissue became especially apparent when someone walked closely behind therobot and the robot started turning away from an optimal bearing relativethe interaction target to an optimal bearing towards the person behind therobot. In order to overcome this issue, we explore and compare two differentframeworks which allow for more robust decision making while maintainingthe assumption of the interaction target being defined as the closest personto the robot. Note that this issue could in principle be resolved by installingan external high accuracy tracking system (e.g. the OptiTrack motion cap-ture system) where the interaction target is always correctly detected andidentified. In this project however, we want to find a solution that uses thesensors of the robot to detect the interaction target. These sensors are simplynot perfect, leading to uncertainty during the decision making and requiresus to use more robust decision making frameworks to solve this task.

The first framework that we will dicuss are the Markov Decision Pro-cesses (MDPs) [2], which are mathematically well-defined models for sequen-tial decision making under uncertainty. The second framework that will bediscussed are recurrent neural networks, which are trained given expert ex-amples. The idea is that these frameworks are capable of filtering out thesepreviously mentioned false positives and allow for a more robust low levelcontroller for the task of maintaining the correct bearing towards the inter-action target.

The main difference between these two frameworks is that the first frame-work first learns a model of the environment (for a partial observable settingusually represented as a Partially Observable Markov Decision Process) andthen solves this model to find the optimal policy. The recurrent neural net-work on the hand, does not first learn a model of the environment but ratherlearns a model end-to-end by minimizing an objective function (i.e. the dif-ference between its own prediction and the expert example). Our initialassumption is that the recurrent neural network will perform better as itdoes not require the difficult task of building a model of the environmentand can therefore learn a model more optimised to the task to solve. Notethat although we suspect the RNN to outperform the POMDP, learning a

22

model of environment has the benefit that it can be used to solve multipletasks by adjusting its reward function.

3.1 Background3.1.1 Markov Decision Processes

When we make choices, we often need to make a trade-off between immediaterewards and possible future gain. These kind of trade-offs are present in oureveryday lives, like deciding to eat at your local snack-bar (high immediatereward) or eating a healthy salad instead (better in the long run). Thesedecisions however, also bring uncertainty with them which makes reasoningon which decision to take significantly more complex i.e. you might get anunexpected allergic reaction from one of the ingredients of the salad.

Markov Decision Processes (MDPs) are designed to model these kinds ofproblems in uncertain environments and try to find an optimal policy whichbalances immediate and future rewards given state transition probabilities.The policy (i.e. a solution from an MDP) that is extracted from the MDPprovides a mapping from states to actions. Optimal policies will provide thebest action to take for each state.

The MDP model contains a discrete set of states S, a discrete set ofactions A, a reward function R, a transition function T and a discount factorγ. The model in its original form assumes full knowledge of the state ofthe environment, however extensions to the model have been made whichalso allow for partial observability as will be discussed later on. The rewardfunction R : S×A→ R specifies the immediate reward that agent receives forperforming action a in state s. The transition function T : S×A×S → [0, 1]specifies the probability over the next states, given the current state and theaction that is performed. Finally there is the discount factor γ ∈ [0, 1), whichrepresents the difference in importance between future rewards and presentrewards.

The idea is to find a policy π for the decision maker that specifies amapping from the state space S to action space A (i.e. π(s) = a for a ∈ A,s ∈ S). Due to the uncertainty over future states, this policy typicallymaximizes the expected discounted reward (see Equation 5).

Eπ

∞∑t=1

γtR(st, π(st))

(5)

where t indicates the time step and t = 1 refers to the current time step.

Note that in our task we are working with an infinite horizon as we do notknow at what time step our interaction with the participant is finished. The

23

horizon h is here defined as the total number of decisions that the agent isallowed to take, in a finite horizon we know the fixed number of decisionsleft until termination (i.e. h − t with t ∈ [1, h]). In our task, we consideran infinite horizon, since the task does not have a fixed time limit and therobot should continue to take actions for as long as the interaction is takingplace. For simplicity and without loss of generality, we provide all equationsassuming an infinite horizon.

In order to measure the reward accumulated by policy π, a value functionV is defined to associate the expected objective value for each state S whilefollowing policy π (i.e. Vπ : S → R):

Vπ(s) = R(s, π(s)) + γ∑s′

T (s, π(s), s′)Vπ(s′) (6)

The Bellman optimality equation [1] states that the optimal value of a stateis given by the expected return of the best action in that state. Using thisexpression, the optimal value function V ∗(s) can be calculated as shown inequation 7.

V ∗(s) = maxa∈A

R(s, a) + γ

∑s′

T (s, a, s′)V ∗(s′)

(7)

This optimal value function V ∗(s) can then be used to extract the optimalpolicy π∗ by for each state taking the action with the highest expected reward.This provides us with a policy π∗ that maps each state s ∈ S to an actiona ∈ A that leads to the highest expected reward (see Equation 8).

π∗(s) = argmaxa∈A

R(s, a) + γ

∑s′∈S

T (s, a, s′)V ∗(s′)

(8)

We can now find the optimal policy π∗ of a problem that is defined asan MDP in a fully observable domain using dynamic programming. In ourtask however, we are working with a robot in a real-world environment, anddue to imperfect and noisy sensors of the robot, we are unable to observethe true state of the environment. Luckily, there is an extension to thepreviously defined MDP framework that allows for decision making withoutaccess to the true state, and is called a partially observable Markov decisionprocess (POMDP) [15]. To do this, the POMDP maintains a belief over thepossible states based on the observations that it received. This belief stateis a probability distribution over which state the robot could currently be in,and is updated after receiving a new observation.

24

More formally, the POMDP contains the same elements as an MDP (i.e.a set of states S, a discrete set of actions A, a reward function R, a transitionfunction T and a discount factor γ), but in addition also contains a discreteset of observations O and an observation function O : O × S × A → [0, 1]that gives the probability of an observation o ∈ O given the state s′ ∈ S andaction a ∈ A.

As the true state is unknown, we maintain a belief state b which for eachstate s, gives the probability that state s is the true state (i.e. b(s) ∈ [0, 1] fors ∈ S with

∑s∈S b(s) = 1). Another difference with respect to the original

MDP is that the policy π is now a mapping from beliefs to actions ratherthan a mapping from states to actions (i.e. π(b) = a for a ∈ A).

The calculation for the expected discounted rewarded for an MDP de-pends on the fact that we know the true state. In the case of an POMDPhowever, as the true state is not observable, the probability for each states ∈ S is multiplied with its expected reward to provide an expected dis-counted reward over the belief state b (see Equation 9).

Eπ

∞∑t=1

γt∑s∈S

bt(s)R(st, π(st))

(9)

When defining a problem as a POMDP, we start with some a priori beliefb0 in what state the agent will start (i.e. in our task this is a uniform dis-tribution over all the possible states). Upon receiving an new observation,this belief distribution is then updated using the state transition and obser-vation probabilities given the action a ∈ A that we were executing and theobservation o ∈ O that we received (see Equation 10).

bao(s′) =

Pr(s′|b, a, o)Pr(b|a, o)

=

O(a, s′, o)∑s∈S

b(s)T (s, a, s′)∑u′∈S

O(a, u′, o)∑u∈S

b(u)T (u, a, u′)for ∀s′ ∈ S (10)

Also the value function V for an POMDP is given over the belief state bas we do not have access to the true state. The value for a belief state iscalculated as the sum of the values for each state s ∈ S multiplied withtheir probabilities b(s) (see Equation 11). Note that in Equation 11 alsothe observation function O is add to calculate the expected reward for thepossible future states s′ ∈ S as the future state now depends on the received

25

observation.

Vπ(b) =∑s∈S

b(s)

(R(s, π(s)) + γ

∑s′∈S

∑o∈O

T (s, π(s), s′)O(π(s), s′, o)Vπ(bπ(s)o )

)(11)

The optimal value of a belief state is then given by the value obtained bytaking the action that maximizes the expected return for that belief state(see Equation 12).

V ∗(b) = maxa∈A

∑s∈S

b(s)

(R(s, a) + γ

∑s′∈S

∑o∈O

T (s, a, s′)O(a, s′, o)V ∗(bao)

)(12)

In the MDP case, we generated our policy π∗ by applying value itera-tion for each state in our discrete state space S, providing us the optimalaction a ∈ A for each state s ∈ S. The POMDP on the hand, acts onbelief states where we want to find the best action a given a belief state b.However, because the belief space is continuous, it would be computationallyintractable to find an exact solution using value iteration in a infinite horizonsetting [21] (i.e. an infinite horizon POMDP has an infinite amount of pos-sible belief points making it impossible to calculate the value for all of thesepoints). Note that in the finite horizon case, this problem is not present aswe could plan starting at our initial belief b0 and enumerate over the possiblefuture belief points until we reached the plan horizon, and therefore do notrequire the entire belief-simplex to define the optimal value function.

It has been shown that in a finite horizon setting, V ∗ is piecewise-linearconvex (PWLC) function [7], and can be well approximated by a PWLCfunction for an infinite-horizon setting [6]. This value function can thereforebe represented by a set of α-vectors V that for each vector α ∈ V gives avalue for each state s ∈ S and is associated to an action a (see Equation 13).

V ∗(b) = maxα∈V

∑s∈S

b(s) · α(s)

(13)

Here, a single α-vector encodes the value of a state s ∈ S for a particularaction a ∈ A and observation o ∈ O over the belief space. Note that thevalue for a single belief point on the α-vectors can be calculated with thesame equation as used in Equation 11.

Given the value function represented by α-vectors, the policy based onthe belief state can be found by taking the α-vector that maximizes thevalue function V ∗(b) and mapping it to its associated action Γ : V → A (see

26

Equation 14).

π(b) = Γ

(argmaxα∈V

b · α)

(14)

An example of the α-vectors for a two state POMDP is given in Figure14. In this example, we can see that when we are certain that we are in states0 (i.e. b = [1, 0]) we should take the action associated with the α-vectorα1 as this vector maximises the value function V for that particular beliefpoint. Note that the same principle also works with more states but that theα-vectors become hyperplanes over the belief space.

V (b)

α1 α2α3

α1 α2 α3[1, 0] [0, 1]

[b(s0), b(s1)]

Figure 14: Linear support of α-vectors for a two state POMDP

To find the policy, we could in principle generate all possible vectorsthat we could construct for every possible action a ∈ A and observationo ∈ O. Because for our policy we only need the α-vectors that maximize thevalue function, a α-vector that is completely dominated5 by other α-vectorscan be pruned away as it does not influence our resulting policy. This wouldprovide us with the solution for the α-vectors to represent our policy, and thismethod is also called the Monahan’s enumeration algorithm [22]. Althoughthis sounds good in theory, this method would not work for our task as weare working in an infinite horizon setting and not all reachable belief pointscan be calculated in a finite amount of time.

In an infinite horizon setting, an ε-approximate value function (i.e. V ∗(b) :V ∗(b) − V (b) ≤ ε for ∀b) could in principle be calculated by exact methodssuch as value iteration, but have however shown to scale poorly with respectto the number of states and observations [24]. Point based methods [30] onthe other hand, perform much better than optimal methods given these scal-ability issues, as they approximate the optimal value function by computingit only for a finite and representative set of belief points. These methodsfocus on the fact that given an initial belief, there is only a finite set ofreachable belief states, making calculating the optimal value function for the

5An α-vector is completely dominated when for each point in the belief space there isan other α-vector that leads to a higher value.

27

entire belief space often unnecessary. Point based methods select a finitesubset of beliefs B, and iteratively update the value function V over B. Thevarious point based methods differ in the process of selection the belief setB and how the value function V is updated.

In this thesis, we will use the Perseus point based algorithm [32] whichselects B by random exploration in the belief space. This exploration is doneby sampling actions and observations leading to new beliefs for our beliefsset B. Based on this belief set B, we iteratively update value function Vwhere with each iteration we get closer to V ∗.

As will be discussed in more detail in Section 3.3, we have decided towork with an asynchronous execution rate for this task to relieve ourselvesfrom the task of manually assigning the execution rate. The main reasonfor this is that especially in robotics applications, a synchronous executionrate generally requires a large amount of fine tuning to get right. Practically,this means that we do not perform an action at a certain time interval butrather act in response to events triggers (i.e. observation changes). Thepreviously defined POMDP should therefore be extended to allow for theseevent changes to be modelled into the POMDP framework.

The event-driven POMDP is structurally almost identical the originalPOMDP but with the main difference in its observation function O. Thisfunction, rather than giving the probability of an observation o ∈ O given thestate s and action a, now gives the probability of an observation change giventhe corresponding state change and action. It could however be possible thatwe received an observation change, but that there was no corresponding statechange (i.e. observation changes caused due to the imperfect sensors of therobot). These are the false positives that we want to ignore, and is possibledue to fact that these events will yield a low probability of actually occurring(i.e. the interaction target can not suddenly appear on the opposite side ofthe robot). There is however another possibility, namely that the interactiontarget actually changed its position but the sensors did not detect it. Here,we would like to model the possible sequences of intermediate states thatmight have be visited unknowingly due to the false positive detection.

To model this, we add an additional symbol for false negative event de-tections of to our observation set O = O ∪ of . Also, note that the observa-tions o ∈ O in the event driven POMDP refer to observation changes ratherthan direct observations as was the case with the synchronous POMDP.The observation function for the event driven POMDP therefore becomes:O : O × S × S × A → [0, 1], where this function gives the probabilities ofperceiving observation o ∈ O given states s ∈ S, s′ ∈ S and action a ∈ Asuch that O(a, s, s′, o) = Pr(o|s, s′, a). Note that the states s ∈ S and s′ ∈ Sindicate the state transition events from state s to state s′.

28

Although by definition the agent will never receive a false negative ob-servation, value iteration does take this additional symbol into considerationduring planning. Effectively allowing the agent to reason over the interme-diate states that might have be visited unknowingly due to the false positivedetections. During planning, a new action is selected at every event trigger.However, because the agent can not receive false negative observations, theagent is also incapable of changing its action during those event triggers.

This leads to an additional constrained to the system, namely that whenwe receive of we cannot change our action. By adding a constraint gener-ation function C(a, o) that for a given action a and observation o returnsa constraint action set to our planner (see Equation Equation 15), we canplan over observation change events containing the false negative observationevent of .

C(a, o) =

a, if o = of

A if o ∈ O\of(15)

To solve the event driven POMDP, we use the Constraint-Compliant Perseus(CC-Perseus) algorithm [10] which is an adaptation to the previously definedPerseus algorithm to event driven POMDPs.

3.1.2 Recurrent Neural Networks

First we provide a short introduction on artificial neural networks (ANNs),in this introduction we will describe the architecture of neural networks andshow how these networks can be trained. Once we have a clear understandingof neural networks, we will describe how this network can be extended intorecurrent neural networks (RNNs) and describe our motivation to work witha recurrent network rather than a standard neural network. Finally, we willdescribe Long Short Term Memory (LSTM) which is again an extension torecurrent neural networks and is the network architecture used throughoutthis thesis.

Neural networks are typically organized in layers, where each layer con-tains nodes that are interconnected with the nodes of their adjacent lay-ers. The connections between these nodes are all weighted, where the initialweights are initialized arbitrarily and later updated using backpropagation(i.e. performing gradient descent based on backward propagation of the er-ror). In such types of networks, information is always fed forward and neverbackwards (i.e. we do not allow loops), and is therefore also called a feedfor-ward neural network (see Figure 15) [4].

In order to get a better understanding of how the different parts of aneural network interact, we give a step-by-step overview of the different parts

29

x1

x2

x3

z1

z2

z3

z4

h1

h2

h3

h4

y1

y2

z′1

z′2

Hiddenlayer

Inputlayer

Outputlayer

wij w′ij

x h yw w′

Figure 15: An example of a feedforward neural network

for calculating the output y (i.e. applying forward propagation) and thenapply backpropagation to update the weights (w and w′) of the networkbased on the error of the previously computed output.

The architecture of the neural network shown in Figure 15 will be used asan example for our calculations. The input layer of this network (indicatedin green) consists of 3 nodes where each node corresponds to an elementof the input vector x (i.e. x = [x1, x2, x3]). The nodes in the hidden andoutput layer also contain an additional activation function that introducesnon-linearity into the network and also helps to stabilize the network6 Thefirst step to calculate the output of the network is to calculate the pre-activityof the hidden layer by multiplying the input x by the weights w (see equation16).

[x1 x2 x3

]×

w11 w12 w13 w14

w21 w22 w23 w24

w31 w32 w33 w34

=[z1 z2 z3 z4

](16)

Note that we here go through the network for a single data point only, andthat you in principle could also change the input x to a matrix of n × 3where n is the number of data points to process multiple data points atonce. Given the pre-activity of the hidden layer z, an activation function isindividually applied on zi ∈ z to obtain the activation of the hidden layer

6The nodes in the hidden and output layer can also contain a bias term b to shiftthe activation function to either the left or right. This term is arbitrarily initialised andtrained in the same manner as the weights of the connections between nodes.

30

(see equation 17).

f(z) =[f(z1) f(z2) f(z3) f(z4)

]=[h1 h2 h3 h4

](17)

Common choices for activation functions are hyperbolic tangent (tanh) func-tions, sigmoid functions, and rectified linear units (ReLUs). In this example,we will use a sigmoid activation function as shown in Equation 18. The sig-moid function is bounded (i.e it limits the output to a value between 0 and1), easily differentiable (see Equation 19) and is a monotonically increasingfunction, making them easy to handle and the activation function of choicefor this example.

f(x) = σ(x) =1

1 + e−x(18)

d

dxf(x) =

d

dxσ(x) = σ(x)(1− σ(x)) (19)

Now that we know the activity for the hidden layer h, we can calculatethe pre-activity for the output layer z′ by multiplying the activity for thehidden layer h with the weights w′ (see Equation 20).

[h1 h2 h3 h4

]×

w′11 w′12

w′21 w′22

w′31 w′32

w′41 w′42

=[z′1 z′2

](20)

Finally, the output y can be calculated by applying the activation functionon the pre-activity of the output layer z′:

f(z) =[f(z′1) f(z′2)

]=[y1 y2

]. (21)

We have now completed a forward pass through the network, and obtainedthe output of the network y given arbitrarily initialized weights w and w′.The output y would however at this point not make much sense as the weightsare not yet trained. In order to get a sensible output from the network, weneed to update the weights based on the error of the forward pass. This errorJ is given as the sum of the difference in prediction y given the true label y:

J =1

2

|y|∑j=1

(yj − yj)2. (22)

31

Here, a quadratic term is add to make sure the error function is convex andthat gradient descent can leads us to the global minimal error. Additionally, a12

is add to simplify the equation later on and does not influence the learningprocess because we are interested in minimizing the error function ratherthan the real valued error.

In order to know how much each individual weight contributed to thefinal error, we need to calculate the partial derivatives for the weights w andw′ given the error J . We start with the partial derivatives for the weightsclosest to the output (i.e. w′), and can be calculated as shown in Equation23. Note that here refers to the Hadamard (entry-wise) product.

∂J

∂w′=∂ 1

2

∑|j|j=1(yj − yj)2

∂w′= hᵀ

((y − y)

d

dz′f(z′)

)(23)

Next, we can calculate the gradients for the weights between the input layerand the hidden layer, and can be calculated as shown in Equation 24.

∂J

∂w=∂ 1

2

∑|j|j=1(yj − yj)2

∂w= xᵀ

((y − y)

d

dz′f(z′)w′

d

dzf(z)

)(24)

Now that we know the partial derivative of the weights over the cost J , wecan update the weights as shown in Equations 25 and 26. Here, η ∈ [0, 1] in-dicates the learning rate of the network and influences the speed and qualityof learning.

wi,j = wi,j − η∂J

∂w(25) w′i,j = w′i,j − η

∂J

∂w′(26)

We have now successfully applied a single step of the iterative processof gradient descent to minimize the objective function J and to learn theoptimal weights for this network. With this, we can now alternately applyforward propagation to calculate the output and backward propagation toupdate the weights of our neural network.

In our task of maintain the correct bearing towards the interaction targetin a noisy environment however, reasoning given only the input (i.e. obser-vation) at the current time step xt would not be enough to solve this task.Namely, we would not be able to disambiguate between noise and true obser-vations. The distinction between noise and true observations can be foundby looking at previously obtained observations as will be discussed in moredetail in Section 3.3.

Recurrent Neural networks allow for loops within the network architec-ture, essentially allowing for information to persist throughout the time steps,

32

making them suitable for sequential decision making problems. In Figure 16,you can see a recurrent neural network containing a single self containinghidden layer. In such an RNN, the hidden layer at time step t is calculated

ot

h

xt

⇒

o0

h0

x0

o1

h1

x1

. . .

. . .

ot

ht

xt

Figure 16: An unfolded recurrent neural network

as follows:ht = f(x · w + ht−1 · U). (27)

In this equation you can see that another weight matrix U is add to thecalculation to indicate how the previous hidden layer ht−1 should contributeto the output of the current time step.

For updating the weights, an extention to backpropagation algorithmcalled backpropagation through time (BPTT) [38] can be used. BPTT un-rolls the RNN as shown in Figure 16, the equations for the partial derivativeslink one time step to the next, essentially generating a chain of equations thatcan be solved using backpropagation. It has however been shown that longterm dependencies are very hard to learn for RNNs [3]. The problem lies inthe fact that an important event that happened multiple time steps ago hasa very low influence on the current time step. This problem is for this reasonalso called the vanishing gradient problem.

An extension to the RNN called the Long short term memory (LSTM) [14]does not have this problem as it allows memory to preserved between timesteps by maintaining a so called cell state. In each time step we can access,add and remove activity from this cell state where important events couldbe stored into the cell state and maintain a large influence on the outputof the network multiple time steps later (i.e. solving the vanishing gradientproblem). An LSTM has the same chain like structure as an RNN, but ratherthan containing a single neural network layer, it contains four neural networklayers to control the status of the cell state and the output of the network.

In order to control the status of the cell state C, the LSTM uses neuralnetwork layers using sigmoid activation functions to decide what informationis allowed to be add, removed or outputted from the cell state. The outputof these sigmoid layers provide for each element in the cell state, a valuebetween 0 and 1 that indicates the quantity of information that is allowed topass trough. These sigmoid layers are for this reason also called gates, and

33

in an LSTM there are 3 gates in total; a forget gate, an input gate and anoutput gate. In Figure 17, the structure of an LSTM can be found wherethese different gates can also be observed.

×

σ σ tanh σ

‖

+

× tanh

×

ht−1

ct−1

ftCt

ctit

ot

ht

ht

xt

xt

connectionbraching pointmultiplicationsumconcatenationneural networklayerhyperbolic tan-gentlogistic sigmoid

Legend

×

+

‖

tanh

σ

Figure 17: Long short term memory

We now provide an overview of the calculations for the different parts ofthe LSTM and show how these different parts interact with each other. Notethat the input of an LSTM at time time step t is given by the actual input xt,the previous hidden layer ht−1, and additionally also the cell state from theprevious time step Ct−1. Also note that each of the neural networks withinthe LSTM take as input the concatenation of xt with ht−1 which is indicatedas (ht−1‖xt).

The first step is to decide what information we want forget from our cellstate C (i.e. the forget gate). Note that although with an LSTM we wantto learn long term dependencies, there are also situations where we wantto be able to forget something from our cell state. We might for exampleget as input that we are working with a new interaction target and thatsome interaction target depended information is not applicable any more forthe current and next time steps and this information should therefore beforgotten. The output of the forget gate ft given the input xt and precioushidden layer ht−1 is calculated as follows:

ft = σ(wf · (ht−1‖xt) + bf ). (28)

This output ft is then applied to the cell state with a point-wise multiplica-tion, where a value of 0 will complete remove the activity and a value of 1will leave the activity as is.

34

Next, we want to know what information should be add to the cell state.This addition consists of two parts; what information would we like to add(i.e. the new candidates) and what information do we allow to be add tothe cell state (i.e. the input gate). The new candidates for the cell state arecalculated by a neural network layer containing an tanh function given theinput xt and previous hidden layer ht:

Ct = σ(wC · (ht−1‖xt) + bC). (29)

Simultaneously, we look at what information we allow to be add to the cellstate:

it = σ(wi · (ht−1‖xt) + bi). (30)

This output it gives again a value between 0 and 1, where this time we donot add any activity to the cell state if the value is 0 and add all the activityto the cell state when the value is 1. Given what activity we should forgetft, and what we should add it with the corresponding new candidates Ct, wecan calculate the new cell state for time step t:

Ct = ft ∗ Ct−1 + it ∗ Ct. (31)

This is an point-wise multiplication of what we want to forget with an point-wise addition of what we want to add to the cell state.

To get the output of the network, we need to know what we want tooutput from our cell state C (i.e. the forget gate):

ot = σ(wo · (ht−1‖xt) + bo). (32)

Given the updated cell state Ct, we can calculate the output and hiddenlayer of the LSTM layer as follows:

ht = ot ∗ tanh(Ct). (33)

Note that a tanh is add to limit the cell state values between -1 and 1. Also,because the output of the LSTM is identical to its hidden layer ht, if we wanta specific output we need to add an additional neural network layer mappingfrom ht to the desired output.

3.2 Defining the state and action space

In Section 2.2, a state representation was already defined for the robot toextract social cost functions based on the Euclidean distance and bearingtowards the interaction target. In this state representation, the symmetry

35

and view-ability of the screen of the robot was exploited to minimize thenumber of states in the state space. In our task of maintaining the correctbearing towards the interaction target however, this state representation cannot be used for the low level controllers as we need to able to distinguishbetween the orientation of our observations and actions (i.e. turn to the leftor right). This clashes with our previous exploitation of the symmetry of therobot, and requires us to define a new state representation for the task ofmaintaining the correct bearing towards the interaction target.

We know that regardless of the distance between the robot and the in-teraction target, we aim to maintain an as small as possible bearing towardsthe interaction target. For this task, we therefore only consider the bearingtowards the interaction target ϕ ∈ [−π, π] as a feature for the state represen-tation. You could argue here that because we already know that we want tominimize |ϕ|, a straightforward control policy keeping |ϕ| close to 0 would beenough to solve this task. Although this would be true if the agent was ableto perfectly detect and identify the interaction target, the difficulty of thistask comes from the fact that there is noise present in the observations of therobot. Moreover, it has be shown in previous experiments that not takingthis noise into consideration during decision making leads to a suboptimalcontrol policy [35].

As in Section 2.2, we discretize the state space as decision theoretic con-trollers have shown to have difficulties when dealing with continuous staterepresentations. Because we can not exploit the physical properties of therobot (i.e. symmetry and screen view ability) for this task, we divide thebearing ϕ into equally sized partitions to discretize the state space. For thisdiscretization, we need to choose the number of states which is a trade-offbetween:

Fewer states More states- Lower computational complexity - Higher computational complexity- Requires less experimental data vs - Requires more experimental data

to learn the optimal policy to learn the optimal policy- Lower precision - Higher precision

Given these trade-off’s, and the fact that we are not working towards a finalproduct but rather aim to compare two different frameworks against eachother, we have decided to divide the bearing ϕ into ten equally sized statesas shown in Figure 18. By using these ten states, the system should still havea high enough precision to properly maintain the correct bearing towardsthe interaction target while minimizing the computational complexity of the

36

Figure 18: The equally sized partitioned state space based on the bearingtowards the interaction target with |S| = 10.

problem and the required experimental data needed to train the differentframeworks.

By using this discretization, we obtain the state space S with si ∈0, 1, ..., 9. The state is given by a mapping K : ϕ → S, where this map-ping K is given by applying K-means clustering on the sine and cosine ofthe bearing towards the interaction target (i.e. KmeansΦ([sin(ϕ), cos(ϕ)])).The centroids Φ for this clustering are given as shown in equation 34.

Φ = [sin(−π +2xiπ

10), cos(−π +

2xiπ

10)] (34)

where xi ∈ 0, 1, ..., 9.Next to the state space, also the action space A is shared between the

two frameworks to allow for a fair comparison between the two frameworks.This action space is also discretized, as actions in continuous space have alsoshown to be difficult to handle for decision theoretic frameworks, althoughthere exist methods to solve POMDPs with continuous action spaces forsmall-scale robot tasks [36].

A straightforward representation of the action space for a low level con-troller could in principle be defined as the minimalistic set of turning left,idle and turning right. This representation might however be a bit too min-imalistic as it does not take the speed of the action into equation. There-fore, in order to not explode the size the action space but still be able todistinguish between different levels of velocity, we have decided to dividethe different velocities in two categories; slow (cmd vel = 0.3) and fast(cmd vel = 3.0). We distinguish between these two different levels of ve-locity to allow for a more natural behaviour of the robot (e.g. a slow turningaction when the interaction target is detected in an adjacent state and a fast

37

turning action when the interaction target detected behind the robot). Byallowing these two levels of velocities, we obtained our action space A withAi ∈ left fast, left slow, idle, right slow, right fast.

3.3 Semi-asynchronous execution

When working with robots, we need to consider the design choice of when therobot should take an action. A straightforward design would be to let the au-tonomous agent take an action at a pre-defined time interval (i.e synchronousexecution). This method however, has the downside that an execution inter-val needs to be chosen and fine-tuned by the designer of the robot. A longinterval could for example lead to the robot not being able to react to certainevents, whereas a short interval could lead to many idle actions and unnec-essarily complicate the decision making process. Another possible executiondesign is to only let the agent take an action when it observed a change inits environment (i.e. asynchronous execution).

In order to compare the two designs, consider a very simplistic robotthat can only move forward and turn to the right. Moreover, this robot canonly detect objects that are in very close proximity in front of him. In thesynchronous execution setting, the robot executes actions at a predefined in-terval where the actions are based on latest detection that the robot received(i.e. a simple policy could be: “nothing detected” → move forward, “objectdetected”→ turn to the right). In this design, it is however possible that theagent detects an object right after an action was executed, and that if theexecution interval if not short enough the robot is unable to react in time tothe object in front of him.

In the asynchronous execution setting on the other hand, the robot onlyperforms an action when its observation changes. This means that whenthe detection changes from “nothing detected” to “object detected”, therobot can immediately react and start turning to the right. Once there isnothing detected in front of the robot (i.e. the detection changes from “objectdetected” to “nothing detected”), the robot can start driving forward again.

As you can see from this example, the asynchronous execution designprovides a more precise solution to this problem as the agent only executes anew action when it obtains new information regarding its environment. Thereis however also downside to this design, namely that it does not provide anyinformation regarding the time it took between the events. You can arguehere that we could add a time variable to our events to provide this additionalinformation, this would however add a continuous space feature to our statespace and would make solving the problem significantly more complicated.

Another possible solution is to combine the two execution designs (i.e.

38

semi-asynchronous execution) by acting on events and additionally raising atimer event in case no events were perceived within a pre-defined time pe-riod. By using this semi-asynchronous execution design, the agent is ableto immediately react on observations changes and can additionally use thetimer events to get a lower bound on the execution rate. To get a better un-derstanding of how the semi-asynchronous execution influences the decisionmaking, we will compare the asynchronous and semi-asynchronous executiondesigns with respect to our task.

In the asynchronous execution setting, the events e that both frameworkscan receive consist of the states in which the interaction target was detectedwith an additional “no detection” predicate nd (i.e. e ∈ S∪nd). In Figure19, we show the events that might be triggered during the experiment whileusing an asynchronous execution design.

time

5

e1

0

e2

5

e3

0

e4

Figure 19: Example of events triggered during asynchronous execution

In this example, we received a total of 4 events ε during our asynchronousexecution:

ε = 5, 0, 5, 0.Even to an expert, it would impossible to identify the true state given only ε(i.e. there is an 50-50 percent change that we are in either one of the statesbased on the number of event triggers that we received for each observation).

By making the execution design semi-asynchronous however, additionaltimer events are provided that can indicate the duration we were presentin a certain state. This addition is especially important when you considerthat we are working with a telepresence robot and that the majority of thetime, the interaction target will not move during conversion. In Figure 20,we show the events that might be triggered during the same experiment asin Figure 19, but this time using an semi-asynchronous execution design thatadditionally includes an timer event T (i.e. ε ∈ S ∪ nd, T).

In this example, we received a total of 8 events ε′ where we received 5timer events after observing the interaction target in state 5 and did notreceive any timer events after observing the interaction target in state 0.

ε′ = 5, T, T, T, 0, 5, T, T, 0

These timer events can be used to disambiguate the state of of the robot.Namely, we could argue here that because we received more timer events

39

time

5

e1

T

e2

T

e3

T

e4

0

e5

5

e6

T

e6

T

e7

0

e8

Figure 20: Example of events triggered during semi-asynchronous execution

after observing the interaction target in state 5, there is a higher probabilityof the interaction target being in state 5 rather than state 0. This addi-tional information can also be used by the decision making frameworks todisambiguate between false positives and true detections. Additionally, thesetimer events can be used to change action after a certain period of time (i.e.the timer events provide us with additional decision steps). For example, wemight receive an observation that we at first presume to be a false positive,but after receiving multiple timer events we can change our belief that itmight have been a correct detection and update our actions accordingly.

3.4 Initial experiment (data-gathering)

In order to obtain the experimental data required to train the different frame-works, we first conducted an initial experiment where robot acted on a ran-dom policy. This experiment took place in the same interaction room asdescribed in Section 2.3.1, but however with a slight change in the experi-mental set-up. The expert for example was excluded from the experiment asthe robot acted on its own policy rather than being controlled in an “Wizardof Oz” manner. Additionally, we have relieved ourselves from the task ofsetting up a visitor for this experiment as 1) we do not require feedback fromthe visitor to solve this task and 2) the robot acting on a random policywould make a real conversation between the visitor and interaction targetclose to impossible. For this reason, the interaction target was instructed toact as though he/she was having a real telepresence conversation and wouldadditionally reposition when the robot:

1. executed an idle action and did not show any signs turning away

2. was unable to correctly detect the interaction target

The robot followed the following random policy; at every event trigger ε(i.e. observation change) randomly sample an action a ∈ A and execute thataction until a new event trigger has been received. This experiment ran for atotal of 15 minutes (3× 5 minutes), and afterwards the recorded experimentaldata was manually annotated to provide the ground truth position of the

40

(a) Top view camera (b) Robot Kinect view (c) Rviz

Figure 21: Experiment overview

interaction target during the entire length of the experiment. In Figure 21,an overview of the experiment is given where you can see the interactiontarget interacting with the robot.

3.5 Implementation of frameworks

Now that we have obtained an annotated data-set containing the detectionsand ground truth position of the interaction target while the robot executeda random policy, we can start implementing our frameworks. In this sec-tion, we describe the implementation of an event driven POMDP in Section3.5.1 and the implementation of the LSTM in Section 3.5.2. In section 3.6,both frameworks will be compared to each other where for the sake of faircomparison, both frameworks are trained on the same previously mentionedannotated data-set.

3.5.1 Partially Observable Markov Decision Process (POMDP)

As explained in Section 3.1.1, a POMDP consists on a state space S, adiscrete set of actions A, a reward function R, a transition function T , anobservation O and a discount factor γ. The state space S and action space Aof our POMDP are already defined in Section 3.2. In this section however, wediscuss our method of extracting the transition function T and observationfunction O from the data obtained from the initial experiment discussed inSection 3.4 and discuss our design choices with respect to the reward functionR and the discount factor γ. Finally, we show how we solved this generatedPOMDP model and how this model can be applied on the robot.

In order to build the transition function T and observation function O, weuse the annotated data from the initial experiment. This data includes thegold standard position of the interaction target during the entire experiment,where this gold standard position can be used to extract the state s ∈ S for

41

each decision step. In our implementation we use a straightforward frequen-tist approach to calculate the probabilities for the transition and observationfunction, although more complex approaches can also be taken to learn theseprobabilities as will be further discussed in Section 3.6.

First we extract the transition function, which specifies the probabilityover the next states, given the current state and the action that was per-formed and led to the next state (i.e. Pr(st+1|st, at)). To do this, we maintaina dictionary υ that for each state action pair contains the list of states thatwere visited at the succeeding decision step after performing action a ∈ Ain state s ∈ S. At every new decision step (i.e. event trigger), we take theprevious state st−1, the current state st and the action at−1 and add it to thedictionary υ (i.e. υsa = υsa + st+1). The probability for the next state, giventhe current action and current state can be calculate as follows7:

Pr(st+1|st, at) =

∑∀x∈υsa [x = st+1]

|υsa|. (35)

Although this would provide us with a completely valid transition func-tion, you could argue here that because we are working with a straightforwardlow level control problem, that we could generalize over the state action pairsto obtain a higher accuracy. For example, rather than saying that turningleft at state 5 gives a high probability of ending up in state 4, we can ap-proach this from a more relative point of view (e.g. turning left at state stgives a high probability of ending up in the state left of st).

To achieve this generalization, we keep a dictionary ξ that for each actiona ∈ A maintains a list of state changes (i.e. ξa = [st+1 − st, . . . ]). At everynew decision step, we take the previous state st−1, the current state st, andthe action a that led to the current state and add it to the dictionary ξ(i.e.ξa = ξa + (st+1 − st)). The probability of state st+1 given the action atand state st is given as:

Pr(st+1|st, at) =

∑∀x∈ξa [x = st+1 − st]

|ξa|(36)

By doing this for every point in the data-set, we extracted our transitionfunction from the data. The probabilities obtained for this transition functioncan be found in Table 1 which is located in the Appendix. In Figure 22, thetransitions probabilities are plotted for all actions a ∈ A where (for the sakeof plotting) we assume st = 5 (i.e the interaction target is located in front ofthe robot).

7Note that in Equation 35 and 36, [. . . ] are the Iverson brackets. [P ] is defined to be1 if P is true, and 0 if it is false.

42

(a) Left fast (b) Left slow (c) Idle

(d) Right slow (e) Right fast

Figure 22: Transition probabilities for the actions: a) Left fast, b) Left slow,c) Idle, d) Right slow and e) Right fast, where a value of 1 corresponds towhite and a value of 0 corresponds to black.

Now that we have extracted the transition function, we can continue withthe observation function. As we are working with an event-driven POMDP,the observation function provides a probability of the observation event o ∈ Ogiven the state change s ∈ S, s′ ∈ S and action a ∈ A where the observationevents are functions of state changes. In order to calculate the probabilitiesfor these observation events, we need to match the state changes in our anno-tated data-set with their corresponding observation event. To do this, we gothrough all the state changes in the data-set and search for each state change;a corresponding observation event that triggered at the same time. Note thatbecause a state change and an observation event will almost never trigger atexactly the same time, we allow for a ±0.5 seconds in difference by addinga buffer f = 0.5 before setting the state change as a False Negative (FN)observation event oFN (i.e. the interaction target changed state but we didnot observe it). This buffer f allows for small imperfections in the detectionsof the interaction target, and additionally allows for small imperfections inthe manually annotated ground truth.

Once all the state changes in the data-set have been processed, we focus onthe remaining observation events that did not get matched. As there are nomatching state changes for these observation events, these observation events

43

must be False Positives (FP) (i.e. we observed a change in our observationsbut the interaction target did not change its state). In Figure 23, we providean example of how these observation events and state changes are matchedtogether.

time

SO

f

O

FP

S

FN

OS

Figure 23: Matching state changes with observation changes where S indi-cates state changes and O indicates observation events

As was the case for the transition function, we maintain a dictionary tocalculate the probabilities for the observation events. More concretely, wemaintain a dictionary Ω that for each state change s ∈ S,s′ ∈ S and actiona ∈ A, maintains a list of observation events (i.e. Ωss′a = [o, . . . ]). Here, thepossible observation events o ∈ O are given as follows:

s, s′ : os,s′ Detection at state s ∈ S changed to detection at state s′ ∈ Ss, s′ : oT Timer events, nd : oLD Lost detectionnd, s′ : ond,s Regained detection at state s′ ∈ Ss, s′ : oFN False Negative observation event for the state change s,s′

Note that we use oLD rather than o(s,nd) for losing the detection of the in-teraction target as the previous time step already provided us with in whatstate the previous detection was located. Using this symbol saves us |S| − 1possible observation events for our observation function. The same idea cannot be used for regaining a detection as this event provides us with new in-formation of where the interaction target is located. Also, note that FalsePositives observation events do not have their own symbol as their change inobservation is add for the state change s ∈ S = s′ ∈ S.

Given these observation events and our dictionary Ω, we can calculate theprobability of the observation event o ∈ O given the state change s ∈ S,s′ ∈ Sand action a ∈ A:

Pr(ot+1|st, s′t+1, at) =

∑∀x∈Ωss′a

[x = ot+1]

|Ωss′a|. (37)

Now that we obtained the observation function, we can continue with thelast two elements of the environment model, namely the reward function and

44

the learning rate. For defining the reward function, we look at what taskthe agent needs to perform. In this case, we want the agent to maintain thecorrect bearing towards the interaction target (i.e. keeping the interactiontarget directly in front of the robot). There are two straightforward methodsof how we could represent the rewards function for this task, 1) as a linearfunction gradually proving a higher rewards as the interaction target movescloser to the front of the robot (see Figure 24a) and 2) only rewarding theagent when the interaction target is in front of the robot (see Figure 24b).By applying the POMDP policy on the robot, we observed that the Linear

(a) Reward as linear function (b) Reward only best best position

Figure 24: Reward functions, where a value of 1 corresponds to white and avalue of 0 corresponds to black.

reward function was exploited by the POMDP. This exploitation was due tothe agent preferring more state changes and so always choose a fast turningaction to visit as many states as possible around the state in front of therobot. We therefore switched to only positive reward when interaction targetis in front of robot as our reward function. Here, changing to a suboptimalstate does not yield any reward, making this reward function less vulnerableto exploitations.

The discount factor provides a trade-off between future rewards andpresent rewards. For our problem, we set the discount factor γ ∈ [0, 1] to beγ = 0.95 as this was the largest possible value before our problem becameunsolvable. Generally, we try to pick an as high as possible discount factorto be able to plan far into the future. Note that a too large discount fac-tor would make the problem unsolvable whereas a too small discount factorwould not take the future rewards enough into consideration.

Finally, to obtain our policy from the generated environment model, wemodelled the POMDP problem in openmarkov8 and solved this model using

8http://www.openmarkov.org/

45

http://www.openmarkov.org/

the CC-Perseus planning algorithm. From this we obtained our policy thatis essentially a collection of α-vectors. To apply this policy on the robot, weused the ROS MDP package 9.

3.5.2 Long Short Term Memory (LSTM)

Implementations of the LSTM framework are available in various softwarepackages such as Keras [8]. In order to train the LSTM however, we needexpert examples that for each triggered observation event provide the correctaction that should have been performed during that event trigger. In thissection, we will discuss how we generated these expert examples and addi-tionally discuss some of the design choices made while building the LSTMmodel.

To obtain the expert examples, we need to know the true state of theagent during the observation events. The annotated data from the initialexperiment can be used to extract the true state of the robot, and we asexperts could in principle manually annotate the correct actions for eachevent trigger. Although this would in principle provide us with the datarequired to train the LSTM, it would be extremely time-consuming work tomanually annotate the correct actions for all the event triggers.

Because in the annotated data set we already have access to the truestate, we could use this information to automatically generate the expertaction examples. The challenge of our task is that we do not know thetrue state and that we are therefore acting in a partially observable domain,knowing the true state on the other hand makes the problem fully observableand relatively straightforward to solve. Namely, given the annotated data,we can describe our problem as a fully observable MDP. By doing so, we canobtain expert examples generated by the MDP for every decision step in theinitial experiment.

First we discuss how the environment model for this MDP is built up.The environment model for a MDP consists of a transition function and areward function. As we are using this MDP to relieve ourself from manuallyannotating the event triggers, we can use our expert knowledge to build thetransition and reward function for this problem (i.e. we know how the robotshould behave and can therefore model this behaviour in the environmentmodel). For the transition function, we can for example say that when therobot is idling, the state does not change. Additionally, we can say that aslow turning action to the left/right will always lead to the adjacent state onthe left/right, and that a fast turning action has a 25% change of overshooting

9http://wiki.ros.org/markov_decision_making

46

http://wiki.ros.org/markov_decision_making

to the state after that. The values for this generated transition function canbe found in Table 2 located in the appendix.

Also, as we do not have any noise in our transition function, the rewardfunction shown in Figure 24a and the reward function shown in Figure 24blead to the same optimal policy for this MDP (i.e. the reward functions cannot be exploited in either of the two representations). In our implementation,we used the reward function as shown in Figure 24a but we could in principlealso have used the same reward function as we used for the POMDP whichwould have provided an identical policy. This MDP is then solved by valueiteration to obtain our policy π∗ that for each state s ∈ S returns the optimalaction a ∈ A.

Now that we can generate the expert examples from the annotated data,we can focus on how to use these examples as data points to train the LSTM.Note that during training, the input of the LSTM are the observation eventsand the output of the network are the optimal actions given the underlyingtrue states. After learning the LSTM model, only the observation is pro-vided where we hope to obtain the correct action even though some of theobservations might be false positive detections.

As we are working with a discrete set of observations, we can use an one-hot encoding to describe the observations for the input of the network. Asan exception, we set the values for the observation having no detection allto zero. The one-hot encoding representation for our observations thereforelook like the following:

o0 = 1, 0, 0, 0, 0, 0, 0, 0, 0, 0o1 = 0, 1, 0, 0, 0, 0, 0, 0, 0, 0

...

o9 = 0, 0, 0, 0, 0, 0, 0, 0, 0, 1ond = 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Note that because a LSTM is a recurrent network and that we need toprovide the last n observations to calculate the output. In our implemen-tation we have chosen to use the last 5 observations n = 5 as input for theLSTM. We choice this by applying different parameters for n and validatingthe performance of the LSTM in the simulation environment. The input forthe LSTM with n = 5 therefore becomes:

input = ot−4, ot−3, ot−2, ot−1, ot

where t indicates the time step. When we receive a timer event oT , the inputis given as a copy of last observation event such that otT = ot−1.

47

As shown in Section 3.1.2, the LSTM has as output its hidden layer h. Inorder to get a distributions over the actions as output however, we need anadditional neural network layer which maps from |h| elements to |A| elementswith a softmax activation function. To train the network, again an one hotencoding of the actions a ∈ A is used:

left fast = 1, 0, 0, 0, 0left slow = 0, 1, 0, 0, 0

idle = 0, 0, 1, 0, 0right slow = 0, 0, 0, 1, 0right fast = 0, 0, 0, 0, 1

We now obtained all the elements in order to train the LSTM, where theoutput of the network is calculated by inputting the observation into a LSTMlayer and then mapping the output of the LSTM layer to a distribution overthe possible actions with an additional neural network layer. After training,we can input the last n observations of the robot and obtain the action bytaking the highest valued action from the outputted distribution.

When building an LSTM layer in Keras, additional regulation parameterscan be set for the LSTM and can change the behaviour of the robot quitesignificantly. For example, the “W regularizer” applies an L1 or L2 regu-larization to the input weight matrices. This can be used to indicate howmuch the present observation should contribute to the output. Practically,this means that by adjusting this parameter, we can make the system moreor less sensitive to the current observation. Additionally there is also an“U regularizer” which can apply an L1 or L2 regularization to the recurrentweight matrices. By adjusting this parameter, we can make the networkmore or less sensitive the previous observations that we received.

In our implementation we have set the “W regularizer” to apply an L2regularization of 0.2 to make the agent less sensitive to the current observa-tion and overall more robust to False Positive detections. This value is againchosen and validated by applying different parameters in the simulation en-vironment.

3.6 Comparing POMDP and LSTM

In order to compare the two frameworks, we ran an additional experimentwith the same set-up as our initial experiment in Section 3.4 but this timerather than using a random policy, we alternated between the policy of thePOMDP and the LSTM frameworks. By doing so, we obtained 6 × 5 min-

48

utes of experimental data running the POMDP policy and 6× 5 minutes ofexperimental data running the LSTM policy.

Before comparing the two frameworks based on their performance, wefirst look at in what aspects they differ and how this could influence theirperformance. The main difference is that a POMDP first learns a model ofthe environment and then derives a policy from that model, whereas a LSTMlearns a model from end-to-end. The advantage of building a model of theenvironment is that it can be used to solve different problems by changingits reward function. For example, if at first we wanted a social robot thatalways tries to face the interaction target but now want the robot turn toa goal bearing while avoiding the interaction target, a change in the rewardfunction would be enough to solve this new task. This flexibility is not presentin the LSTM, here we would have to provide new expert examples to solvethat particular task.

There is however also a disadvantage of building a model of the environ-ment, namely that building such a model is fairly difficult due to the manymoving elements (i.e transition function, observation function and rewardfunction). The LSTM on the other hand is optimised to our task, and de-rives the policy from expert examples. We therefore expect the LSTM toperform better in our task of maintaining the correct bearing towards theinteraction target whereas the POMDP implementation allows for general-ization over multiple tasks.

In order to provide a performance measurement for the policies of theframeworks, we take the percentage that the interaction target was in theideal position (i.e. in front of the robot) to score the behaviour of the robot.If the robot is behaving optimally, the interaction target will always be infront of the robot where we would receive a score of 100%. Any suboptimalbehaviour (i.e. turning towards false positives or not following the interactiontarget) will be punished by the fact that the interaction target will not be infront the robot. This score (i.e. the percentage correct position over time) iscalculated for all 12 (6×2) experimental runs and provides us a performancemeasurement for the comparison between the frameworks.

It would be possible that there is no significant performance differencebetween the two frameworks and in order this test this, we set the followinghypothesis:

H2 := The POMDP framework and the LSTM framework perform

as well on the task of maintaining the correct bearing towards

the interaction target.

To test hypothesis H2, we applied a T-test on the sets of scores for both

49

frameworks. This test resulted with a p-value of 0.04 which rejects the hy-pothesis and shows there is a statistically significant performance differencebetween the two frameworks. In Figure 25, a box plot of the scores forboth frameworks for the low level controller is given. In this figure you canclearly see that the LSTM performs better than the POMDP in our task ofmaintaining the correct bearing towards the interaction target.

Figure 25: Boxplot of scores for the POMDP framework vs LSTM framework

An important difference between the POMDP and the LSTM that weobserved is that the model build in the POMDP also takes the behaviourof the interaction target into account. We for example observed that thePOMDP occasionally choose the action Idle over a turning action when theinteraction target was in a adjacent state from the optimal state (i.e. in eitherstate 4 or 6). We assume this is because the interaction target tends to moveto the optimal state to optimize its experience when having a telepresenceconversation through the robot. This leads to higher transition probabilitiesfor the action idle to lead to the correct bearing state.

Now that we have a comparison between the two frameworks with re-spect to the performance, we will now discuss the ease of implementationbetween the two frameworks. Both frameworks require the ground truth ofthe position of the interaction target. For the POMDP we need to extracta transition and observation function from the experimental data. But ad-ditionally also define a reward function and set the discount factor. For theLSTM on the other hand, we need to provide expert examples of the actionsthe robot should have done, and be either manually annotated or in our caseautomatically annotated using a MDP given the ground truth position of theinteraction target.

50

4 Conclusion

In the first part of this thesis, we extracted social cost functions to estab-lish a performance metric for the behaviour of our autonomous agent in asocial setting. These social cost functions covered the two features that byour partners were identified as the relevant features for solving the task ofpositioning the TERESA robot in a socially acceptable manner during con-versation, namely the distance and the bearing relative to the interactiontarget.

For the distance, we identified proxemic-based interaction ranges giventhe feedback collected during the experiments to learn and distinguish be-tween the distance at which the agent should approach the interaction targetfor the personal, social, and public proxemic zones. During our experiments,we observed (by adding background noise to simulate hearing difficulties)that the distribution of these interaction ranges changes in the presence ofhearing problems. We have therefore extracted additional proxemic-based in-teraction ranges that describes at what distance the agent should approachthe interaction target when hearing problems are detected. The extractedsocial cost function with respect to the distance, described the social costfor each of the interaction ranges. Here the highest reward (i.e. lowest cost)can be obtained when the agent finishes its approach within the social prox-emic zone. Especially in a multi objective setting (discussed in more detailin section 5), a trade-off between the social costs for the different proxemicranges and the cost for another objective can lead to a challenging decisionmaking problem. It might for example be better to finish the approach ofthe agent in the private proxemic zone rather than the social proxemic zoneto optimize a second cost function such as the detectability of the interactiontarget.

As for the bearing, we used the data-set provided by partners at theuniversity of Twente to extract a social cost function with respect to thebearing and additionally also performed a statistical analysis on the statefeatures for a group conversation setting. This test indicated that the bearingof the robot relative to the position of the center of the group has a directinfluence on the grade of the social correctness for a participant in the group.The bearing with respect to the participant on the other hand, did not havean influence on the grade of the social correctness for that participant. Withrespect to the state features, you could argue here that when we want todefine a performance metric for the behaviour of our autonomous agent ina social setting for a group conversation setting, the bearing with respect tothe center of the group should be considered.

In the second part of this thesis, we performed a comparison between

51

two different types of low level controllers to assist future development ofsuch a controller for the TERESA project. In order to solve our task ofmaintaining the correct bearing towards the interaction target while actingin a noisy environment, we implemented and compared two different frame-works; Recurrent neural networks and Partially Observable Markov DecisionProcesses. By performing real world experiments, we have shown (with astatistical significance) that the LSTM outperforms the POMDP frameworkfor our task. This indicates that using an LSTM framework, shows goodpotential to solve the low level control tasks for the TERESA robot.

Another contribution of this thesis to the TERESA project are data-setsgenerated during the experiments as discussed in Section 2.3 and Section 3.4.These experiments can for example be used learn new environment modelsfor the TERESA project to solve different tasks (i.e. tasks with different stateand observation functions). Also, cost functions can be extracted based onthe detectability of the interaction target given the recorded sensor data ofthe TERESA robot during the conducted experiments.

We thus made three major contributions to the TERESA project. First,we provided a method for extracting social cost functions to establish a per-formance metric for the behaviour of our autonomous agent in a social settingand additionally performed a statistical analysis on the state features for agroup conversation setting. Next, we compared low level controllers to as-sist further development for such a controller for the TERESA project andshowed that an LSTM shows high potential for solving the task of positioningin a socially acceptable manner. Finally, the data-sets generated during theexperiments of this thesis can be used for future research within the TERESAproject. These contributions bring us closer to the ultimate aim of a sociallywell-behaved telepresence robot.

5 Future Work

In the first part of this thesis we focussed on the social position of the robotwhere social cost functions were extracted to describe a measurement of so-cial correctness for the agent. By converting these social cost functions intoa reward function for the decision making framework, the agent is able tooptimize its actions so to perform socially optimal behaviour. It might how-ever be the case that this is not the only objective function that we wantto optimize. We could for example also have an additional cost functionfor the detectability of the interaction target that can be used to maximizethe probability of correctly detecting the interaction target. Multi-objectivePOMDPs [26], allow for multiple objectives to be added to the decision mak-

52

ing process, essentially allowing it to make trade-offs between the multipleobjectives. In the previously mentioned example, we might for example wantto be closer to the interaction target to get a better detection probability buton the other hand also not get too close as it might become socially unac-ceptable.

In our implementation of the POMDP, we used a straightforward frequen-tist approach to learn the transition and observation function. This howeverdoes not include any smoothing and can lead to a very crude estimation ofthe environment model. By for example training a neural network on thedata obtained from the experiment, we can estimate the transition and ob-servation function by querying this trained neural network. This would leadto much more smoothed functions for the environment model and in turnlead to better performance.

In more recent work, the LSTM and POMDP frameworks have also beencombined, where LSTMs have been used to solve POMDP problems [39].Namely, as neural networks are general function approximators, these net-works could be used to learn POMDP value functions from data. This wouldallow us to avoid using POMDP planners which are known to scale poorlywith the problem dimensions. Additionally, it allows for the policy to belearned directly as a mapping from histories of observations to actions, whichallows us to avoid modelling the latent state space explicitly. In an POMDPthe observations are however not Markovian (i.e. the output of the networkdoes not only depend on the present observation), where our network shouldbe able to keep memory between the time steps to solve the POMDP prob-lem. A shown in section 3.1.2, LSTMs are able to keep memory betweentime steps making them well-suited to learn POMDP value functions.

During the experiment described in Section 2.3.1, an additional camerawas also placed facing the pilot. The processing of facial expressions however,is the research of our partners at the Imperial College London [37] and thisvideo data was not used in this thesis other than extending our dataset forfuture use. These facial expressions could however for example be used toderive instantaneous feedback from the pilot which can be used to update theactions of the agent accordingly. Additionally, this feedback could also beused to obtain data for an additional objective in a multi-objective setting.

53

A Deploying the TERESA code

In this section we explain how to deploy the TERESA code on your systemand solve some of the common errors that you may encounter while deployingthe code. For this installation guide, we assume a clean installation of Ubuntu14.04 (Trusty Tahr) with ROS indigo installed as described on the ROS page10 ( following the introductory tutorial is also highly recommended). Thecode for the TERESA project is located in a Bitbucket repository11 as wellas at a server located at the University of Amsterdam (UvA)12. To deploythis repository to a new catkin workspace in your home folder, you can runthe following command:

$ ( cd $HOME && mkdir t e r e s a c a t k i n p k g s && cd t e r e s aca tk in pkgs && mkdir s r c && cd s r c && g i t c l one https : // b i tbucket . org / jm e s s i a s / t e r e s a c a t k i n p k g s . g i t. && sh ws setup . sh )

In the case you get the error “catkin make: command not found”, be surethat you have sourced ROS correctly in your bashrc. At this stage, errorsand warnings may appear because we are missing standard libraries that arenot included in the source code of the TERESA project. For example, whenyou encounter the error “package ’orocos-bfl’ not found” you can install thepackage ros-indigo-navigation to solve this issue. Packages can be searchedand installed with the following commands:

$ sudo apt−cache search . . .$ sudo apt−get i n s t a l l . . .

In case the package can not easily be found or if you are not sure whichpackage to install, remember that Google is your friend. To re-compile thecode, you can run the following:

$ cd $HOME/ t e r e s a c a t k i n p k g s$ catkin make

If there are still errors in packages, you can use the argument “–keep-going”for catkin make which skips the packages that don’t compile. This is espe-cially useful if you get errors in packages which you will not be using andstill want to compile the rest of the packages.

10http://wiki.ros.org/indigo/Installation/Ubuntu11https://bitbucket.org/jmessias/[email protected]:teresa_catkin_pkgs

54

http://wiki.ros.org/indigo/Installation/Ubuntu

https://bitbucket.org/jmessias/teresa_catkin_pkgs.git

[email protected]:teresa_catkin_pkgs

A.1 Simulator

The simulator used by ROS is the gazebo simulator13. The gazebo simulatorwith the TERESA environment can be started with the following command:

$ ros launch t e r e s a gaz ebo t e r e s a gaz ebo . launch

Again, this may yield some errors due to missing packages that can be easilyresolved. First make sure that the following packages are installed:

- ros-indigo-controller-manager- ros-indigo-joint-state-controller- ros-indigo-position-controllers

An other error that you may encounter is “Error [Param.cc:181] Unable to setvalue . . . ” which can be fixed by setting LC NUMERIC=C in your bashrc.The next possible issue are the missing model files, which are located in thesource code of the TERESA project but are not yet linked properly. Thepath of the models (teresa catkin pkgs/src/teresa gazebo/models) should beadded to the path of the gazebo models (GAZEBO MODEL PATH). If ev-erything is configured correctly, the simulator should look something like inFigure 26.

Figure 26: Gazebo simulator with the TERESA environment

To get an visualization of what the sensors of the robot are perceiving, a toolcalled Rviz can be used. Rviz is by default installed with ROS, however torun it with the correct parameters for the TERESA project (including therobot model) it is the easiest to run it as follows:

13http://gazebosim.org/

55

$ ros launch t e r e s a gaz ebo t e r e s a r v i z . launch

An error that may occur while launching this file is “ResourceNotFound:rrbot description”, this is because we are missing the package rrbot descriptionwhich is included in the gazebo ros demos project14. This project can be in-stalled as follows:

$ cd $HOME && mkdir −p catk in ws / s r c$ cd catk in ws / s r c &&$ g i t c l one https : // github . com/ ros−s imu la t i on / gazebo

ros demos . g i t$ cd . .$ catkin make

If everything is configured correctly, Rviz will look something like in Figure27. On the ROS page15, you can find detailed documentation and tutorialson how to use Rviz.

Figure 27: Rviz with TERESA environment

The things add to my bashrc to make everything work are as follows:

source /opt/ ro s / ind igo / setup . bashsource ˜/ catk in ws / deve l / setup . bashsource ˜/ t e r e s a c a t k i n p k g s / deve l / setup . bash −−extendexport GAZEBO MODEL PATH=˜/ t e r e s a c a t k i n p k g s / s r c / t e r e s

14https://github.com/ros-simulation/gazebo_ros_demos15http://wiki.ros.org/rviz

56

https://github.com/ros-simulation/gazebo_ros_demos

http://wiki.ros.org/rviz

a gazebo /models :$GAZEBO MODEL PATHexport LC NUMERIC=C

The robot can now be controlled by publishing messages of the type geome-try msgs/Twist to the topic /cmd vel e.g.:

$ r o s t o p i c pub −1 / cmd vel geometry msgs /Twist −− ’ [ 2 . 0, 0 . 0 , 0 . 0 ] ’ ’ [ 0 . 0 , 0 . 0 , 0 . 0 ] ’

This command will move the robot forward as can be seen in the simulator.There are also scripts available in the source code of the TERESA projectwhich allow for full control of the robot by using a joystick.

57

B Value tables

Table 1: Transition function POMDP

PPPPPPPPPStateAction

left fast left slow idle right slow right fast

St − 4 0.0000 0.0000 0.0038 0.0072 0.0000St − 3 0.0102 0.0031 0.0057 0.0096 0.0213St − 2 0.0000 0.0093 0.0132 0.0193 0.0319St − 1 0.0816 0.0679 0.0660 0.2000 0.5000St 0.4082 0.6636 0.7981 0.6723 0.3830

St + 1 0.3673 0.2222 0.0755 0.0627 0.0532St + 2 0.0612 0.0278 0.0113 0.0072 0.0000St + 3 0.0510 0.0031 0.0113 0.0072 0.0106St + 4 0.0102 0.0039 0.0000 0.0048 0.0000St + 5 0.0102 0.0000 0.0152 0.0096 0.0000

Table 2: Transition function MDP

PPPPPPPPPStateAction

left fast left slow idle ˇ right slow right fast

St − 4 0.0000 0.0000 0.0000 0.0000 0.0000St − 3 0.0000 0.0000 0.0000 0.0000 0.0000St − 2 0.0000 0.0000 0.0000 0.0000 0.2500St − 1 0.0000 0.0000 0.0000 1.0000 0.7500St 0.0000 0.0000 1.0000 0.0000 0.0000

St + 1 0.7500 1.0000 0.0000 0.0000 0.0000St + 2 0.2500 0.0000 0.0000 0.0000 0.0000St + 3 0.0000 0.0000 0.0000 0.0000 0.0000St + 4 0.0000 0.0000 0.0000 0.0000 0.0000St + 5 0.0000 0.0000 0.0000 0.0000 0.0000

58

References

[1] R Bellman. Dynamic programming: Princeton univ. press, 1957.

[2] Richard Bellman. A markovian decision process. Technical report, DTICDocument, 1957.

[3] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-termdependencies with gradient descent is difficult. Neural Networks, IEEETransactions on, 5(2):157–166, 1994.

[4] Christopher M Bishop. Pattern recognition. Machine Learning, 128,2006.

[5] Joost Broekens, Marcel Heerink, and Henk Rosendal. Assistive socialrobots in elderly care: a review. Gerontechnology, 8(2):94–103, 2009.

[6] Anthony Rocco Cassandra. Exact and approximate algorithms for par-tially observable Markov decision processes. Brown University, 1998.

[7] Hsien-Te Cheng. Algorithms for partially observable Markov decisionprocesses. PhD thesis, University of British Columbia, 1988.

[8] Francois Chollet. Keras. https://github.com/fchollet/keras, 2015.

[9] Marco Cristani, Loris Bazzani, Giulia Paggetti, Andrea Fossati, AlessioDel Bue, Gloria Menegaz, and Vittorio Murino. Social interaction dis-covery by statistical analysis of f-formations.

[10] Joao Vicente Teixeira de Sousa Messias. Decision-Making under Uncer-tainty for Real Robot Teams. PhD thesis, Instituto Superior Tecnico,2014.

[11] Nestoras Foustanas. Helping elderly users control a telepresence robotwith a touch screen, August 2015.

[12] Edward Twitchell Hall. The hidden dimension . 1966.

[13] John A Hartigan and Manchek A Wong. Algorithm as 136: A k-meansclustering algorithm. Journal of the Royal Statistical Society. Series C(Applied Statistics), 28(1):100–108, 1979.

[14] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.

59

https://github.com/fchollet/keras

[15] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra.Planning and acting in partially observable stochastic domains. Artificialintelligence, 101(1):99–134, 1998.

[16] Adam Kendon. Conducting interaction: Patterns of behavior in focusedencounters, volume 7. CUP Archive, 1990.

[17] Rachel Kirby, Reid Simmons, and Jodi Forlizzi. Companion: Aconstraint-optimizing method for person-acceptable navigation. In Robotand Human Interactive Communication, 2009. RO-MAN 2009. The 18thIEEE International Symposium on, pages 607–612. IEEE, 2009.

[18] S. Kohlbrecher, J. Meyer, O. von Stryk, and U. Klingauf. A flexibleand scalable slam system with full 3d motion estimation. In Proc.IEEE International Symposium on Safety, Security and Rescue Robotics(SSRR). IEEE, November 2011.

[19] Timm Linder and Kai O Arras. People detection, tracking and visual-ization using ros on a mobile service robot. In Robot Operating System(ROS), pages 187–213. Springer, 2016.

[20] Timm Linder, Stefan Breuers, Bastian Leibe, and Kai O Arras. Onmulti-modal people tracking from mobile platforms in very crowded anddynamic environments.

[21] Omid Madani, Steve Hanks, and Anne Condon. On the undecidabilityof probabilistic planning and related stochastic optimization problems.Artificial Intelligence, 147(1):5–34, 2003.

[22] George E Monahan. State of the arta survey of partially observablemarkov decision processes: theory, models, and algorithms. ManagementScience, 28(1):1–16, 1982.

[23] Noe Perez-Higueras, Rafael Ramon-Vigo, Fernando Caballero, and LuisMerino. Robot local navigation with learned social cost functions. InInformatics in Control, Automation and Robotics (ICINCO), 2014 11thInternational Conference on, volume 2, pages 618–625. IEEE, 2014.

[24] Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. Point-based valueiteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pages1025–1032, 2003.

[25] Morgan Quigley, Brian Gerkey, Ken Conley, Josh Faust, Tully Foote,Jeremy Leibs, Eric Berger, Rob Wheeler, and Andrew Ng. Ros: anopen-source robot operating system.

60

[26] Diederik M Roijers. Multi-objective decision-theoretic planning. PhDthesis, University of Amsterdam, 2016.

[27] Selma Sabanovic, Marek P Michalowski, and Reid Simmons. Robots inthe wild: Observing human-robot social interaction outside the lab. InAdvanced Motion Control, 2006. 9th IEEE International Workshop on,pages 596–601. IEEE, 2006.

[28] Tamie Salter, Kerstin Dautenhahn, and R Bockhorst. Robots movingout of the laboratory-detecting interaction levels and human contact innoisy school environments. In Robot and Human Interactive Commu-nication, 2004. ROMAN 2004. 13th IEEE International Workshop on,pages 563–568. IEEE, 2004.

[29] Satoru Satake, Takayuki Kanda, Dylan F Glas, Michita Imai, HiroshiIshiguro, and Norihiro Hagita. How to approach humans?-strategiesfor social robots to initiate interaction. In Human-Robot Interaction(HRI), 2009 4th ACM/IEEE International Conference on, pages 109–116. IEEE, 2009.

[30] Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-basedpomdp solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.

[31] Cory-Ann Smarr, Akanksha Prakash, Jenay M Beer, Tracy L Mitzner,Charles C Kemp, and Wendy A Rogers. Older adults preferences forand acceptance of robot assistance for everyday living tasks. In Proceed-ings of the Human Factors and Ergonomics Society Annual Meeting,volume 56, pages 153–157. SAGE Publications, 2012.

[32] Matthijs TJ Spaan and Nikos Vlassis. Perseus: Randomized point-basedvalue iteration for pomdps. Journal of artificial intelligence research,24:195–220, 2005.

[33] Leila Takayama and Caroline Pantofaru. Influences on proxemic be-haviors in human-robot interaction. In 2009 IEEE/RSJ InternationalConference on Intelligent Robots and Systems, pages 5495–5502. IEEE,2009.

[34] TERESA. Deliverable D3.1: Normative Behavior Report. Projectfunded by the EU through grant ref. FP7-ICT-2013-10-611153. Avail-able online athttp://teresaproject.eu/wp-content/uploads/2015/03/D3_1_

NormativeBehaviourReport_TERESA.pdf.

61

http://teresaproject.eu/wp-content/uploads/2015/03/D3_1_NormativeBehaviourReport_TERESA.pdf

http://teresaproject.eu/wp-content/uploads/2015/03/D3_1_NormativeBehaviourReport_TERESA.pdf

[35] TERESA. Deliverable D5.2: Action Library. Project funded by the EUthrough grant ref. FP7-ICT-2013-10-611153. Available online athttp://teresaproject.eu/wp-content/uploads/2015/08/D5_2_

Action_Library_TERESA.pdf.

[36] Sebastian Thrun. Monte carlo pomdps. In NIPS, volume 12, pages1064–1070, 1999.

[37] Robert Walecki, Ognjen Rudovic, Vladimir Pavlovic, and Maja Pan-tic. Variable-state latent conditional random fields for facial expressionrecognition and action unit detection. In Automatic Face and GestureRecognition (FG), 2015 11th IEEE International Conference and Work-shops on, volume 1, pages 1–8. IEEE, 2015.

[38] P.J. Werbos. Backpropagation through time: what it does and how todo it. Proceedings of the IEEE, 78(10):1550–1560, 1990.

[39] Daan Wierstra, Alexander Foerster, Jan Peters, and Juergen Schmidhu-ber. Solving deep memory pomdps with recurrent policy gradients. InInternational Conference on Artificial Neural Networks, pages 697–706.Springer, 2007.

[40] James E Young, Richard Hawkins, Ehud Sharlin, and Takeo Igarashi.Toward acceptable domestic robots: Applying insights from social psy-chology. International Journal of Social Robotics, 1(1):95–108, 2009.

62

http://teresaproject.eu/wp-content/uploads/2015/08/D5_2_Action_Library_TERESA.pdf

http://teresaproject.eu/wp-content/uploads/2015/08/D5_2_Action_Library_TERESA.pdf

Documents

Decision-Making under Uncertainty for Social Robots · Social acceptability however, is a subjective measurement of performance ... Next, we give an overview of the robots hardware