Spoken Interaction with Synthetic Entities - Autenticação · Spoken Interaction with Synthetic Entities Artur David Felix Ventura Disserta˘c~ao para obten˘c~ao do Grau de Mestre

Spoken Interaction with Synthetic Entities

Artur David Felix Ventura

Dissertacao para obtencao do Grau de Mestre emEngenharia Informatica e de Computadores

Juri

Presidente: Doutora Maria dos Remedios Vaz Pereira Lopes Cravo

Orientador: Doutor David Martins de Matos

Vogais: Doutora Ana Maria Severino de Almeida e Paiva

Outubro 2010

Acknowledgements

First of all I would like to express my thanks to my supervisor Professor David Matos. Thanks for theguidance, support and allowing me the freedom to research a field that is dear to me.

I would like to express my gratitude to Nuno Dieges for the help during this thesis. Thanks for goingabove and beyond of what was asked from you.

Also I would also like to thank all my friends that helped me during my course. I would like to thankthose who did not let me go crazy during this year, in particular to Carlos Domingos, without who Iwould not be here today, Rita Pita for being there when I needed her, and Leonor Pinto for not lettingme panic more than once. Thank you for all the support and company.

Finally, but not least, I would like to dedicate this dissertation to my Family. My big sister who helpedme a lot during my course, My parents who supported me during this endeavor and without whom Icould not have come so far with their help. Your effort was not in vain.

Thanks for everything.

Lisboa, November 26, 2010Artur David Felix Ventura

l’enfer, c’est les autres

-Jean-Paul Sartre

Resumo

A forma como os utilizadores interagem com as maquinas tem vindo a alterar-se ao longo dos anos, devidoem parte aos avancos nas capacidades das maquinas. Uma das formas interessantes de comunicarmoscom sistemas sinteticos e atraves da utilizacao de fala.

No entanto, a construcao de sistemas interactivos depende de dois aspectos importantes. Primeiro enecessario compreender de que forma a fala pode ser processada num determinado contexto. Por outrolado, estes sistemas existem no nosso mundo e devem poder desempenhar tarefas nele. Deste modo,para a criacao de um sistema interactivo e necessario combinar competencias de dois tipos de sistemas:sistemas de dialogo e sistemas de agentes

Esta tese apresenta uma arquitectura para a criacao de sistemas interactivos que combinam processa-mento de lıngua natural com sistemas de agentes. Apresentamos tambem um modelo mental que permiteassociar informacao linguıstica com informacao do mundo. Alem disso, esta arquitectura esta ligada aoambiente fısico atraves de um robot.

De forma a demonstrar a arquitectura apresentada, apresentamos um prototipo funcional, capaz deprocurar objectos fısicos numa cena.

Avaliamos este prototipo com um cenario simples, que permitiu compreender se o comportamentodeste sistema e transmitido ao utilizador atraves de alteracoes emocionais.

Abstract

The way users interact with machines has been changing over the years, due in part to advances inthe capabilities of the machines. One of the interesting ways to communicate with synthetic systems isthrough the use of speech.

However, the construction of interactive systems depends on two important aspects. First, it isnecessary to understand how speech can be processed in a given context. Second, these systems exist inour world and must perform tasks in it. Thus, it is necessary to combine competencies of two types ofsystems: Dialog Systems and Agent Systems.

This thesis presents an architecture for building interactive systems that combine natural languageprocessing with agents systems. We also present a mental model that allows the association betweenlinguistic information and world information. Moreover, this architecture is linked to the physical envi-ronment through a robotic body.

In order to demonstrate the proposed architecture, we present a functional prototype which is able tointeract through speech and capable of searching for physical objects in a scene.

We evaluate this prototype with a simple scenario that allows us to understand if the user can perceiveemotional changes by this system behavior.

Palavras Chave

Keywords

Palavras Chave

Sistemas de Dialogo

Sistemas de Agentes

Interpretacao

Interaccao Verbal

Keywords

Dialog Systems

Agent Systems

Interpretation

Spoken Interaction

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 5

2.1 Dialogue System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 TRIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 DIGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Galatea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 Olympus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Greta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 CoSy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 FAtiMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Interpretation and Generation 14

3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.1 Linguistic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.2 Knowledge Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.3 Language and Knowledge Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Generating Meaning Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.2 Matching Meaning with Valid Actions . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 System Architecture 19

4.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.1 Linguistic Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Visual Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.1 Speaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

i

4.4.2 Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 The complete Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Implementation 25

5.1 Mind and Competence Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Competences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.1 Input Competence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.2 Ontology Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2.3 Movement and Robot Sensory Competence . . . . . . . . . . . . . . . . . . . . . . 305.2.4 Speech Competence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Evaluation 33

6.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2.2 Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.2.3 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.2.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.3.1 Change on the emotional state of the agent . . . . . . . . . . . . . . . . . . . . . . 366.3.2 Emotional expressiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.3.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Conclusions & Future Work 41

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.2.1 Advanced speech capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.2.2 Agent to agent communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.2.3 Adding new behavior by teaching the system . . . . . . . . . . . . . . . . . . . . . 427.2.4 Supporting Multi-Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.2.5 Error Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

ii

List of Figures

1.1 Agent structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 TRIPS architecture (reproduced from [3]) . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 DIGA Architecture (reproduced from [17]) . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Galatea architecture (reproduced from [13]) . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Olympus architecture (reproduced from [6]) . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Greta Architecture (reproduced from [20]) . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 PlayMate architecture, one of the scenarios produced from the CoSy Project (reproduced

from [7]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.7 FAtiMA Architecture (this and subsequent image was adapted from [8]) . . . . . . . . . . 112.8 Action Tendencie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Relations between Verb, Sense, FrameSet, Frame and α-structure. . . . . . . . . . . . . . 153.2 Ontological and linguistic representation of the relationship between “Ball” and “Object”. 163.3 Interpret algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Combinations algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.5 Meanings algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 LIREC’s Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Input natural language processing chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Frame Instantiation from a Tagged Syntactic Tree . . . . . . . . . . . . . . . . . . . . . . 214.4 Natural Language Generation processing chain . . . . . . . . . . . . . . . . . . . . . . . . 234.5 Our adaptation of LIREC’s Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Our implementation architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 CMION main window, awaiting for FAtiMA. . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 FAtiMA main window, showing the knowledge base . . . . . . . . . . . . . . . . . . . . . . 285.4 A sample of our ontology description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.5 Our physical body, Rovio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.6 The circle detection output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.1 The set used for evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2 Emotional Detection by Age on the First Situation . . . . . . . . . . . . . . . . . . . . . . 376.3 Emotional Detection by Age on the Second Situation . . . . . . . . . . . . . . . . . . . . . 376.4 Emotional Detection by Gender on the First Situation . . . . . . . . . . . . . . . . . . . . 386.5 Emotional Detection by Gender on the Second Situation . . . . . . . . . . . . . . . . . . . 386.6 Emotional Detection by Age and Gender on the First Situation . . . . . . . . . . . . . . . 396.7 Emotional Detection by Age and Gender on the Second Situation . . . . . . . . . . . . . . 396.8 Overall Emotional Detection on both scenarios . . . . . . . . . . . . . . . . . . . . . . . . 40

iii

List of Tables

2.1 Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Comparison between components of Agent Systems and Dialogue Systems . . . . . . . . . 13

6.1 Age distribution for the first scenario inquiry . . . . . . . . . . . . . . . . . . . . . . . . . 346.2 Age distribution for the second scenario inquiry . . . . . . . . . . . . . . . . . . . . . . . . 34

iv

Chapter 1

Introduction

1.1 Motivation

With advances in robotics we are now foreseeing a future where we will have robots as companionsin our daily life. Some of these advances include Unmanned Aerial Vehicles (UAV) being deployed inIraq and Afghanistan; Boston Dynamics’ dynamically stable quadruped robot BigDog that can mimicanimal movement and is being used on the field to carry equipment; iRobot’s equipment that ranges fromRoomba, an autonomous cleaning robot, to PackBot, a drone designed for hostile situations.

However, in spite of this evolution, even with all the advances in robotics, we still cannot interact withthis equipment in the same natural way we deal with animals or humans. As we come close to creatingrobots that can assist us in our daily life, we will need to find ways of interacting with them verbally.

The focus of this thesis is on creating an interactive system, connected to a physical entity (a robot),so that it is possible to interact verbally and with it and perform tasks in our world.

Using voice to control machines is not new: even the first personal computers had sufficient computingpower to recognize simple dialogue and synthesize rudimentary speech. Nonetheless, we are increasinglyseeing this form of interacting being used to interact with cars (allowing the driver to concentrate on theroad), homes, telephony services, and even news casting.

1.1.1 Dialogue Systems

Dialogue systems are systems that are controlled by voice interaction. They provide a way to recognize,process, act, and respond to human spoken interaction. These systems have been evolving for some time,but the majority of them share the same basic structure. Normally, a dialogue system architecture iscomposed by three modules:

• Input: Composed by speech recognition and natural language processing components. Receives astream of audio and generates a representation that can be interpreted by the dialogue.

• Dialogue Manager: Composed by components that maintain the dialogue state, plan the dia-logue progression, maintain a domain representation, etc. It receives a structure representing whatwas said and generates a structure of what it intends to say.

• Output: Composed by natural language generators and speech synthesizers, is responsible for gen-erating language that represents what the system wants to say.

1

Dialogue systems typically work on specific domains with a set of restricted tasks. This simplifies bothnatural language processing (NLP) and dialogue planning. However, small, restricted domains create aproblem when interacting with complex environments. A robot roving an environment needs to maintaina world model of where it is, that model needs to keep stored a complex set of entities, and that modelmight be in constant change. This type of models is called Open Domain models. This type of domainis a set of entities and relations that contains several themes. Due to the interaction with an externalworld, these domains can grow and change over time. An entity that interacts intimately with our worldrequires a knowledge base capable of representing this kind of domain. It is also important to noticethat to roam a complex world as ours it is necessary to know how to interact and act in non-verbal ways(e.g. motion). Unfortunately, the current main focus of current Dialog Systems is to advance the stateof the art by attacking problems directly related with natural language processing/generation, discoursecontext, etc. To accomplish this however, even with state of the art dialogue systems, the difficulty is noton how to understand or generate speech, but how to perceive the meaning of what is said, or planningwhat is going to be said.

1.1.2 Agent Systems

There is another class of systems that focuses on mimicking human behavior. Agent systems [22] pro-vide the software backbone for synthetic characters. Those systems react to stimuli, normally calledperceptions, from the environment, producing actions that affect that environment, and handle tasks likeplanning, reasoning, and acting. They are normally used to design agents that live in a world (real orvirtual). They also maintain complex representations of world knowledge.

decisons

Agent Environment

sensors

actuators

stimuli

actions

Figure 1.1: Agent structure

Agent systems are normally divided in terms of how they calculate the next action and what infor-mation they use. Reactive agent systems [26] are systems that use external information and executethe rule that matches the current world state. This kind of agent normally contains simple rules. Inspite of this simplicity, complex behavior may arise, due to emergent behavior. Because those systems donot have world models, it is necessary to receive enough information from the world so they can performtheir tasks. Moreover, it is difficult to design reactive agents that can learn from experience.

Another kind of agent systems are the Deliberative Agent Systems [27, 24]. These systems containa representation of the world, about which the decisions are made. However, it is common to find agentsystems that combine both reactive and deliberative aspects. Those systems are called Hybrid AgentSystems. They attempt to use both aspects from the two approaches: speed on finding a valid action,on the reactive side; and using complex plans, from deliberative systems.

Most of those systems are normally concerned with acting on a given environment and lack the

2

advanced linguistic knowledge and processing abilities of a Dialogue System.Imagine a system composed of a robot that can detect colored balls in a scene and is able to receive

commands verbally. If we were to ask such a system: “Rose find a blue ball”, in a typical DialogueSystem, it could parse the phrase, detect verb and arguments, create a syntactic structure representingthe phrase and even associate semantic knowledge with each of the worlds.

Yet it is unlikely that a Dialogue System had the capacity to perform path planning, or objectrecognition required to act on such a command. On the other hand, an agent system might have thosecapacities. However, without some kind or natural language interactivity, we would not be able to tell itverbally what to do.

Therefore, in order to create more natural communication with agents that surround us, it is importantto develop a system that combines both dialogue and agent features.

1.2 The Problem

Creating interactive systems that combine both dialog and agent system features leads to complex andinteresting problems. Dialogue systems are design to tackle problems related with discourse, lackingemotional states. Agent systems react to stimuli from the environment and produce actions that affectthat environment and furthermore, these systems live in a place, virtual or real.

First, how can we reconcile an agent’s perception and its linguistic input? Moreover, how can theagent’s knowledge base aide and enhance the interpretation of a speech act? It is also important tofind if the perception input and the dialogue system natural language processing chain have commoncomponents.

Both kinds of system have different knowledge: dialogue system has linguistic knowledge that needs tobe associated with an agent’s world representations. It is important to understand how we can consolidateboth kinds of information.

1.3 Objectives

This thesis presents an integrated architecture for an interactive system combining features of bothdialogue and agent systems. This architecture combines dialogue abilities with agent systems’ expressivemodels of the world. This system attempts to fulfill the following objectives:

Integration The cognitive structures from dialogue systems are designed to handle essential discourseelements, such as dialog state, what is going to be said next and what are the mentioned entities.Structures from agents systems handle other kind of elements such as actions, intentions, plans andgoals. One of the objectives of this work is to present an architecture that combines functionalityfrom both systems in an integrated interactive system.

Knowledge Sharing Dialogue systems have components that map behavior from agent systems, e.g.knowledge bases, however, handling different information. It is important to understand how in-formation from one system can be linked to another. Unifying both kinds of information allowsthe system to understand or speech about the context of its current understanding of the world.Another objective for this work is to develop a memory model that supports storing and retrievinginformation annotated with linguistic information. Also, such a model must be integrated withnatural language processing and generation.

Convey Emotional Changes Some of the agent architectures presented in Chapter 2 contain emo-tional states. These states allow the user to simulate an affective state about the world that

3

surrounds it. Such information can be used to convey information to a user about the status ofthe current task. The last objective that this work is to develop an interactive system that uses anemotional state, affected by external stimuli, dialogue being one of these stimuli.

1.4 Document Structure

This thesis proposes an architecture that combines agent and dialogue features in a single interactivesystem. We use as base for our work an existing architecture, and we have extended its functionality byadding natural language processing/generation and annotated knowledge information.

This document describes the design of interactive system architecture, its subsequent instantiation ina prototype and organized as follows:

• Chapter 2 (Related Work) gives an overview of the current state of the art in dialogue and agent-based systems. We also discuss how both kinds of systems can be integrated.

• Chapter 4 (Architecture) presents our proposal for an architecture and we explain in what way wecan extend it to support natural language processing.

• Chapter 3 (Conceptual Model) introduces our cognition system and how we can integrate bothontological and linguistic information.

• Chapter 5 (Implementation) presents the implementation of the choices made that aim to achieveour proposed goals.

• Chapter 6 (Evaluation) describes the evaluation preformed to assess whether the proposed solutionachieves our goals.

• Chapter 7.1 (Conclusion) presents a overview of what has been done, and discussed future directions.

4

Chapter 2

Related Work

We first present an overview of related work on both dialogue and agent systems. It is important tounderstand how we can integrate both kinds of systems by analyzing what is in the current state of art.Work related with this thesis is scattered on fields like natural language processing, cognitive systems,agent systems and synthetic characters.

2.1 Dialogue System

2.1.1 TRIPS

TRIPS (The Rochester Interactive Planning System)[4, 3, 2, 11] is considered a classic architecture inthe dialogue systems field. This architecture is divided in three sections: Interpretation, Generation eBehavior, as it can be seen on Figure 2.1.

The Interpretation section is responsible for transforming speech into a structure that contains themeaning of the speech act. The Interpretation Manager controls this sectoin. This section produces textfrom the Automated Speech Recognition (with possible help from Task Manager module) and updatesthe Discourse Context module with information from the Task Manager module. The Discourse Contextmodule is responsible for maintaining a representation of discourse at each moment and is consulted byboth the Interpretation and Generation sections. This module maintains a model of mentioned entities,a structure and interpretation of the last utterance, information about which part is going to talk next,discourse history and obligations. Another module handles the knowledge (the Reference module) storesnamed entities on the system.

The Task Manager module maintains a model of the domain objects. This model can answer questionsabout domain objects and works as an interface between the plans from the Behavior Agent and thePlanner Scheduler. The Generation section transforms dialogue intentions into speech. The GenerationManager manages this section. This module receives discourse intentions by the Behavior Agent anddiscourse obligations from Discourse Context and produces content planning, which is then given to theSpeech synthesis.

The last section, Behavior is responsible for the problem solving. This section is controlled by theBehavior Agent and reacts to user utterances and events, besides managing objectives and current obli-gations.

2.1.2 DIGA

The DIGA (DialoGue Assistance) [17, 18, 12] is a dialogue system based on TRIPS and like this, isdomain independent and task oriented. Therefore both systems have similar architectures. Is composed

5

2.6. UTILIZACAO DE GERACAO DE LINGUA NATURAL 33

interpretacao de lıngua, gestao de dialogo, entre outras.

Foi escolhida uma arquitectura modular (versoes anteriores utilizavam uma maior

concentracao de funcionalidade em alguns modulos) como forma de melhorar o desempenho

do sistema na realizacao das suas funcoes, assim como a sua flexibilidade. A figura 2.5 mostra

a arquitectura TRIPS (Allen et al., 2001, figura 1).

Reference

ContextDiscourse

Speech Graphics

Parser

Speech

ResponsePlanner

Planner Scheduler Monitors Events

Behavior

TaskManager

Task and domain!specificknowledge sources

ExogenousEvent Sources

RequestsTask Execution

InterpretationProblem!solving

Acts recognized

Problem!solvingInterpretationRequests

Task from user

Acts to perform

InterpretationManager

ManagerGeneration

AgentBehavioral

Generation

Figura 2.5: Arquitectura TRIPS.

Os modulos principais da arquitectura sao o gestor de interpretacao, o gestor de com-

portamento e o gestor de geracao. Estes modulos utilizam modelos gerais para resolucao de

problemas (comuns a todos os dialogos). Os modelos sao formalizados como accoes a exe-

cutar sobre os objectos envolvidos no processo de resolucao: objectivos, solucoes, recursos e

situacoes. Como se trata de um ambiente colaborativo, algumas accoes podem requerer mais

Figure 2.1: TRIPS architecture (reproduced from [3])

of three modules, Input and Output Manager, Dialogue Manager and Service Manager as seen in figure2.2.

The Input and Output manager is composed by a set of modules that manage the transformationof speech into text and vice-versa. This manager can be seen as a concatenation from TRIP’s sectionsInterpretation e Generation. Besides that, this manager has the ability to synthesize facial expressionson a virtual face.

In the Dialogue Manager, a set of modules identifies the text given as input, generates text to besynthesized and maintains the current dialogue state.

The dialogue received by the Input and Output Manager is send to the Parser, witch attempts toidentify relevant objects in the utterance. The Parser uses the Interpretation Manager which constructsan interpretation with information from the Context Discourse and the domain. The Context Discourseis a module that maintains the discourse history and active discourse obligations.

If the system already has all the necessary information then correct service is called from the TaskManager. If not, discourse obligations are send to the Behavioral Agent. This module is responsible forsending the obligations to the Generation Manager, which activate them and sends them then to theSurface Generator. This attempts to find a rule of generation in the domain and sends the text to besynthesized by the Input and Output Manager.

2.1.3 Galatea

Galatea [13] is a toolkit for developing dialogue systems that mimic human behavior.This system architecture is composed of four modules: Speech Recognition, Speech Synthesizer, Facial

Animation Synthesizer, Agent Manager and Task Manager. A graphical representation of the architecture

6

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technologyfrom seed

L2 F - Spoken Language Systems Laboratory

Apoio (Arquitectura)

Figure 2.2: DIGA Architecture (reproduced from [17])

is presented in Figure 2.3.The Speech Recognizer module was design to allow restarting grammar processing while the system

is running. Also, this system is capable of developing the N result for a given utterance. The SpeechSynthesizer module receives text with pronunciation tags and creates speech. This module also has to bein sync with the Facial Animation module. To achieve this, information related with the phoneme at eachinstance is shared between both modules. The Task Manager module manages the communication tasks.The “Tasks” besides storing dialogue description also store facial expressions for the Facial Animationmodule.

2.1.4 Olympus

Olympus [6] is a framework for the creation of dialogue systems. This architecture is very similar to theones previously shown. Is composed of six modules, as presented in Figure 2.4.

The speech recognizer receives a stream of audio, which passed to a set of recognizer. These then selectthe best hypothesis for each recognizer. These hypotheses are passed to the natural language recognizerthat produces for each one a structure containing the concepts extracted from the last utterance.

Next, the hypothesis is given a confidence tagger that associates a level of confidence to each hypoth-esis. This value reflects the probability of the utterance being correctly understood.

Using a hypothesis that presents the best confidence level, the Dialog Manager interprets it on thecurrent dialogue context and chooses a task to perform next. Tasks are represented in a tree, where eachleaf is a action.

The discourse obligations are send to the Natural Language Generation. This uses a template-basedsystem that produces a surface form, which finally is send to the Speech Synthesizer.

7

of the toolkit can customize the agents easily dependingon the purposes and applications. The customizability in-cludes that the agent characters should be replaced easilyby changing the face and voice of a person to those of ananother person.

2.3. Modularity of functional unitsIn some situations, system creators or toolkit users will

not be satisfied with the performance of the original mod-ules in the toolkit and they would like to replace them withthe new ones or add new ones to the system. In such cases,it would be desired that each functional unit (module) iswell modularized so that the users can develop, improve,debug and use each unit independently from the other mod-ules. This would help to improve the efficiency of softwaredevelopment.Moreover, modularizing the functional units enables the

system to work in parallel,

2.4. Open-source free softwareThe technology used for creating the toolkit is still not

enough to achieve human-like conversation. Therefore itis desired that not only the creators of the toolkit but alsothe researchers and developers who use the toolkit wouldcontribute to improve the toolkit further. In that sense, thetoolkit should be released as a free software along with theprogram source codes.There have been no existing ASDA softwares so far sat-

isfying all of the requirements described above.

3. Toolkit design and outlineIn this section, we discuss the design of the toolkit and

its module functionality to achieve the requirements givenin the previous section.First of all, to fulfill the requirements of modularity and

customizability, the toolkit must have at least three func-tional units (speech recognition, speech synthesis, and fa-cial animation synthesis) for task customization, and a unitfor integrating those units, which we name as “agent man-ager”.

3.1. Speech recognition module (SRM)The authors have been developing the Japanese large

vocabulary continuous speech recognition (LVCSR) en-gines, Julius (Kawahara et al., 1998; Lee et al., 2001) andSPOJUS (Nakagawa and Kai, 1994). Julius employs N -grams as a statistical language model (LM), though, as atoolkit for various tasks, grammar-based LM is suitable forsmall tasks, where easy-to-use and easy-to-customize LMsare preferable. In order to provide such a grammar-basedrecognition engine as a functional module of the toolkit,“Julian” (Fig. 2) has been developed. Julian can changemore than one grammar sets on the instant, and it can out-put incremental speech recognition results.

3.2. Speech synthesis module (SSM)To achieve customizable speech synthesis module

(SSM), the module has to accept arbitrary Japanese textsincluding both of “Kanji” (Chinese) and “Kana” charac-ters, and synthesize speech with a human voice clearly in a

Agent Manager

Task Manager

API

IIPL

SpeechSynthesisModule(SSM)

Face imageSynthesisModule(FSM)

SpeechRecognition

Module(SRM)

Microphone CRT Speaker

Prototyping Tools

AcousticModel

LanguageModel

FaceImage

FaceExpressionInformation

ProsodyInformation

AcousticModel

TaskInformation

DialogModel

Figure 1: ASDA platform

specified style. For this purpose, HMM-based speech syn-thesis method is employed in which spectrum, pitch and du-ration are modeled simultaneously in a unified frameworkof HMM (Yoshimura et al., 1999). Lexical and syntacticanalyzer is developed as well.Another important function of this module is to im-

plement a mechanism for synchronizing the lip movementwith speech, which is called “lip-sync”. The employedmechanism is based on the sharing of each timing and du-ration information of phoneme in the speech that is goingto be uttered between the SSM and the FSM (facial imagesynthesis module).

3.3. Facial image synthesis module (FSM)

The basic software of synthesizing human facial imagescan synthesize human facial animations of any existing per-son if a single photo image of the person is given and theimage is fitted manually to a standard 3D wire-frame model(Morishima et al., 1995). The software including a modelfitting tool is publicly available as a result of the former IPAproject (facetool, 1998). Under the current ASDA toolkitproject, we are enhancing the former software package tosupport higher quality and controllability of agent facialimage, and precise lip-sync with synthetic speech. Fig. 3shows the process of fitting a 3D wire-frame model to areal human face, and the examples of the synthesized hu-man facial images.

Figure 2.3: Galatea architecture (reproduced from [13])

1994). Phoenix uses a semantic grammar to parse the incoming set of recognition hypotheses. This grammar is assembled by concatenating a set of reusable grammar rules that capture domain-independent constructs like [Yes], [No], [Help], [Repeat], and [Number], with a set of domain-specific grammar rules authored by the system de-veloper. For each recognition hypothesis the output of the parser consists of a sequence of slots con-taining the concepts extracted from the utterance.

Confidence annotation. From Phoenix, the set of parsed hypotheses is passed to Helios, the con-fidence annotation component. Helios uses features from different knowledge sources in the system (e.g., recognition, understanding, dialog) to com-pute a confidence score for each hypothesis. This score reflects the probability of correct understand-ing, i.e. how much the system trusts that the cur-rent semantic interpretation corresponds to the user’s intention. The hypothesis with the highest score is forwarded to the dialog manager.

Dialog management. Olympus uses the Raven-Claw dialog management framework (Bohus and Rudnicky, 2003). In a RavenClaw-based dialog manager, the domain-specific dialog task is repre-sented as a tree whose internal nodes capture the hierarchical structure of the dialog, and whose leaves encapsulate atomic dialog actions (e.g., ask-ing a question, providing an answer, accessing a

database). A domain-independent dialog engine executes this dialog task, interprets the input in the current dialog context and decides which action to engage next. In the process, the dialog manager may exchange information with other domain-specific agents (e.g., application back-end, data-base access, temporal reference resolution agent).

Language generation. The semantic output of the dialog manager is sent to the Rosetta template-based language generation component, which pro-duces the corresponding surface form. Like the Phoenix grammar, the language generation tem-plates are assembled by concatenating a set of pre-defined, domain-independent templates, with manually authored task-specific templates.

Speech synthesis. The prompts are synthesized by the Kalliope speech synthesis module. Kalliope can be currently configured to use Festival (Black and Lenzo, 2000), which is an open-source speech synthesis system, or Cepstral Swift (Cepstral 2005), a commercial engine. Finally, Kalliope also supports the SSML markup language.

Other components. The various components briefly described above form the core of the Olym-pus dialog system framework. Additional compo-nents have been created throughout the development of various systems, and, given the modularity of the architecture, can be easily re-used. These include a telephony component, a text

Parsing PHOENIX

Recognition Server

Lang. Gen ROSETTA

Synthesis KALLIOPE

!

SPHINX SPHINX

SPHINX

Confidence HELIOS

HUB Text I/O

TTYSERVER Application Back-end

Dialog. Mgr. RAVENCLAW

Date-Time resolution

Process Monitor

Until what time would you like the room?

{request end_time}

Figure 1. The Olympus dialog system reference architecture (a typical system)

two p.m. [time=2pm] [time=2pm]/0.65

Figure 2.4: Olympus architecture (reproduced from [6])

8

2.2 Agent Systems

2.2.1 Greta

Greta ([20, 21, 12])is a conversational agent capable of showing human like behavior. It was used as amedical assistant capable of talking with both patients and doctors using two different dictionaries. Also,was designed with a personality, emotional reactions and social intelligence.

The system simulates dialogue using question/answering and uses an architecture based on a naturallanguage generator pipeline composed of four components: The agent’s Mind, Dialogue Manager, a planEnricher (Midas) and a generator of the agent’s body as described in Figure 2.5

a particular attention to the way information is exposed tohim and expressed.

While conversing with the User, Greta will display variousexpressions that accompany her speech. In her first speak-ing turn (Greta0) she wants to express her empathy withthe User. She will do it not only verbally (“I’m sorry to tellyou”) but also nonverbally (by displaying the expression ofbeing “sorry-for”). Expressing empathy will not be neces-sary if the conversational partner is a doctor or a nurse.To play down on the seriousness of the illness, Greta willemphasise both verbally and nonverbally the fact that it isstill in a “mild” form. Another facial expression that will beseen in this dialog is a particular gaze direction, that playsa deictic function to indicate a given point in space. In turnGreta1, Greta indicates her chest while saying ‘a spasm ofCHEST’ and looks at the User, in turn Greta2, while saying‘YOUR problem’.

In order to produce such dialog exchanges, the agent shouldhave a knowledge base expressed as a set of beliefs about theworld, events, actions and so on. She should also be providedwith some goal (what she wants to achieve) and with a planestablishing how to achieve her goal (decomposition of thegoal into sub-goals). The “elementary” sub-goals have tobe achieved through specific communicative functions, eachof which is to be expressed by specific verbal and nonver-bal signals; moreover, the signals have to be synchronizedwith each other in a realistic way. Making a deictic ges-ture (denoted by a gaze direction) at the wrong place mightbe interpreted wrongly. So a methodology to synchronizeverbal and nonverbal signals is required. But we also wantthe agent not to be simply a cold and robot-looking agentbut an agent that we can trust and feel sympathy for. Theagent should show emotions and express them in the properway at the appropriate time; therefore, a model of emotionand personality is required for the agent. Moreover, emotionmight evolve through time and these models should be dy-namic, that is they must evolve during the conversation; thesame event may not provoke the same emotion depending onthe different conversational contexts. In our previous exam-ple, we already pointed out that the same agent, a doctor,may behave (verbally and non-verbally) in a different way,depending on the characteristics of the interlocutor and onthe relationship they have with each other. So, a model ofthe conversational context is required. We have listed so fara list of minimum requirements our agent should have to beable to dialog in a believable way with the user. Let us seehow this is achieved in our system.

4. SYSTEM ARCHITECTUREAs mentioned previously, the type of conversation we sim-

ulate is information-giving dialogs in the form of question /answer sub-dialog. Figure 1 shows the different componentsof our system architecture. Three main components are in-cluded: a manager of the Agent’s Mind, a Dialog Manager, aplan Enricher (Midas) and a generator of the Agent’s Body.

When the dialog starts, a dialog goal in a particular do-main is set and passed to the Dialog Manager (DM). Fromthis goal, an overall discourse plan is produced for the Agent,by retrieving an appropriate ‘recipe’ from a plan library.This plan represents the way in which the Agent will tryto achieve the specified communicative goal during the con-versation. the way that a goal may be achieved depends,as well, on the cognitive model of the Agent (what we call

Figure 1: The architecture of our conversationalagent (BN: Belief Network; DBN: Dynamic BN)

her ‘Mind’) that is her beliefs, desires, intentions, with therelations linking them and the levels of uncertainty attachedto these links. This model is employed to simulate how theAgent reacts (both affectively and rationally) to events oc-curring during the dialog. The way the dialog goes on is afunction not only of this plan but also of the following UserMoves and of what we call the Social Context of the conver-sation [12]. Indeed, when talking with somebody we adaptour behavior and sayings to our discussion partner (whatis our relation with her, what we think are her intellectualcapacities and so on) and to the location of the conversa-tion. We also behave differently depending on the topic ofthe conversation, that is depending on how we are related toa particular event or object that we may refer to or that wasmentioned in the conversation. The social context describesthe agent’s role and the relationship existing between theUser and the agent. This context also describes the objects,the events and the actions that may occur in the domainand may influence the Agent’s mental state in a way thatdepends on her ‘personality’ (that is how the Agent reactsto an object/event/action, what is the relation of the Agentto them). Once the output dialog move has been selected bythe DM, this one asks Mind whether a particular affectivestate of the Agent should be activated and with which in-tensity. In the next step the dialog move is enriched by theMidas module that adds tags indicating the communicativefunctions to be synchronized with the verbal stream. Then,this enriched move is passed to the Body Generator, thatinterprets and renders it by producing the correspondingexpressive behavior.

Figure 2.5: Greta Architecture (reproduced from [20])

When the dialogue starts, a dialogue goal in a particular domain is set and passed to the DialogueManager. This goal will provide a discourse Plan produced for the agent. This plan represents the waythe agent will try to communicate. The mind also contributes in how the goal is achieved.

Once the dialogue manager selects the output, the mind is asked for emotional states to enrich thedialog. Also the plan enricher adds tags that allow the synchronization of the audio and video stream.Finally this is then passed to a body generator that presents the dialog.

2.2.2 CoSy

CoSy, Cognitive Systems for Cognitive Assistants, ([7]) was an European project that pretended toconstruct systems that could process, understand and interact with environments. It was a complexproject with advances in planning, vision, space cognition and learning.

The architecture was composed of distributed subcomponents (also called subarchitectures, SA). Eachof those is composed of a working memory and a task manager. The component’s working memory isconnected to and exchange information to other working memories in other sub-architectures. The task

9

manager controls each component of it is sub-architecture. Task Managers are also connected to othersub-architectures, allowing control strategies to be coordinated across the entire architecture.

In Figure 2.6 we have a description of seven modules for the PlayMate robot, of the scenarios fromthe project. These are concerning with visual processing (Visual SA), Communication (ComSys SA),spatial representation and reasoning (Spatial SA), manipulation (Manipulation SA which include relevantvisual processing), ontological representations and reasoning (CoMa SA), binding of information betweenmodalities (Binding SA), and the control of motivation and planning (Motivation and Planning SA).

380 Hawes et al.

Fig. 9.4. An overview of the PlayMate System. There are seven sub-architectures,each of which corresponds to a functional and development unit.

Following this broadly functional decomposition, there are seven sub-architectures (SAs) in the PlayMate. These are concerned with visual pro-cessing (Vision SA), Communication (ComSys SA), spatial representation andreasoning (Spatial SA), manipulation (Manipulation SA which includes rele-vant visual processing), ontological representations and reasoning (CoMa SA),binding of information between modalities (Binding SA), and control of moti-vation and planning (Motivation and Planning SA). In the PlayMate centralroles are played by the last two of these sub-archtictures. In the latter part ofthis chapter we will show how we use these to control most of the reasoningprocesses and flow of control, thus provided general control solutions for avariety of potential tasks (question answering, question asking, cross-modallearning, manipulation, handling ambiguities). We now briefly sketch the com-ponents employed in each sub-architecture and the representations that theyshare in their working memories. The reader should refer to Figure 9.4 to aidunderstanding. Since the CoMa sub-architecture is used only peripherally inthe examples used here, and is explained in Chapter 10 we omit a furtherdescription of it here.

Figure 2.6: PlayMate architecture, one of the scenarios produced from the CoSy Project (reproducedfrom [7])

When a new utterance arrives, a speech recognizer transforms it into text. Next, an incremental parserassociates semantics in a modal logic. In the end, what is obtained is a forest of trees each representing alogical form of a single utterance. This information is then passed into the motivational sub architecturethat creates indexical (information about the entities referred) and intentional (information about thepurpose of that utterance) content. This content is then to be used by the planning SA to raise goalswhen needed.

2.2.3 FAtiMA

The FAtiMA Architecture (FearNot Affective Mind Architecture), [8, 9] is an agent architecture, basedon emotions, designed for the development of agents on the FearNot! Project.

The architecture is composed of four modules: Emotional State, Appraisal, Coping and Memory (seeFigure 2.7).

The Appraisal component receives events caused by the environment. The Reactive or DeliberativeAppraisal then handles these events.

The Reactive Appraisal layer consists of a set of predefined emotional reactions. When a new eventis detected, the reactive appraisal will try to match the event with one of the rules it has. If there is arule, the values of the appraisal variables considered are calculated and used to detect the emotions that

10

Reactive Level

Deliberative Level

Appraisal

Reactive Level

Deliberative Level

Coping

Actuators

Memories

Knowledge Base

AutoBiograficMemory

EmotionalState

Agent in the World

BodySpeechFacialexpressions

Sensors

Reappraisal

Change World Interpretation

Deliberated Actions

Impulse Actions

Figure 2.7: FAtiMA Architecture (this and subsequent image was adapted from [8])

are created.The Deliberative Appraisal Layer is used to predict possible effects of actions. Each time the agent

receives a new perception, checks if any inactive goals have become active. If this is true, these are addedto the intentions also known as agent pending goals. These goals use emotions to describe the probabilityof success or failure. Also, when a target is successful or a failure, they may give rise to emotions suchas despair as, satisfaction and relief.

Emotions are kept in the Emotional State. This module receives the emotions created by the compo-nents of appraisal and processes them based on rules. But the emotions decay with time, so it needs acontinuous processing of them.

The Coping module generates actions that respond to the emotional state. This module is also com-posed of two layers: a reactive and other deliberative. The reactive layer is responsible for selecting ActionTendencies that are unintentional reactions to emotional states. An example of an Action Tendencies isthe beginning of crying in a desperate situation. Figure 2.8 describes this example.

Action Rule

Action: Cry

Preconditions: ---

ElicitingEmotion:

Type: Distress

MinIntensity: 5

CauseEvent:

Subject: --

Action: Push

Target: SELF

Figure 2.8: Action Tendencie

The deliberative component consists of a planner that creates plans to be followed by the agent.The plans are generated from the stronger intentions of the agent. This component can also affect theEmotional State, through a revaluation phenomenon.

11

The memory module comprises two components: a knowledge base, which holds the semantic knowl-edge about the world, and autobiographical memory, which represents a set of independent episodes.Each episode consists of a set of actions or events that happened in the same place at the same time.

2.3 Discussion

In spite of the internal division of the components being different in all dialogue systems, these canbe divided into three common component types: input components, output components and dialoguemanagement.

• The input components contain all the components that process the speech and/or text, and returnsa structure that can be interpreted by the dialogue system.

• The output components contain all the components that transform a structure representing whatwill be said until the speech synthesizer.

• The dialogue management components are all components that maintain the state of the dialogue,interpret the meaning of acts of speech, planned dialogue, etc..

On table 2.1 are presented a division of the previously presented dialogue systems on those compo-nents.

Input Output Dialogue ManagementTRIPS Interpretation Sec-

tionGeneration Section Behavioral Agent, Task

Manager, Planner, Dis-course Context

DIGA Input/OutputManagement

Input/OutputManagement

Dialogue Management(Interaction Manager,Dialogue Context, TaskManager, BehaviourAgent)

Galatea Speech Recognition Facial Synthesizer,Speech Synthesizer

Agent Managment, TaskManagment

Olympus Speech Recog-nition, Parser,Tagger

Generation, SpeechSynthesizer

Dialogue Managment

Table 2.1: Dialogue Systems

As seen above, the agent systems are divided into sensors, actuators and a module that makes deci-sions. Relationships between agents and systems dialogue systems can be found. If we consider that thespeech can be seen as a stimulus to the system, we can see the input components as sensors and outputcomponents as actuators. Although decision management and dialogue management share similarities,they need to be examined in detail, particularly to information that each component uses.

The dialogue systems have a world model to deal with. In the case of TRIPS, this mapping is dividedbetween the Discourse Context and the Reference. These components store information dialogue onentities (entities mentioned, who is going to speak next). In contrast, agent systems that have richermappings are continuously updated with arrival of new stimuli. In FAtiMA’s case, there are two typesof memory: the Knowledge Base has information about the semantics world and the autobiographicalmemory contains the set of actions performed. There are, however, some similarities between the modelsmemory of the two types of systems. The Dialogue Context and autobiographical memory represent thepresent time (the memory autobiographical goes further and maintains a history of actions sets which

12

were run in similar times and places). The Reference only keeps the entities mentioned in dialogue.Although simple is similar to the Knowledge Base, as since it keeps long-term information.

Moreover, both systems have to make decisions. These tasks are carried out by planners and by TaskManager. For dialogue systems, what is planned is progression of the discourse, but in the case of agentsystems is the actions. It is also important to note that all speech can be described through two actions: “Say” and “Hear”. With this equivalence, we obtain a homeomorphism dialogue between planning andaction.

Finally, it remains to interpret the input data. This is conducted by the Behavioral Agent and inpart by the layer Interpretation. The agent system uses the deliberative and reactive layers to gain anunderstanding of stimulus. In the case of reactive layers, understanding is done through rule matching,while in the deliberative layer is done by inspecting the changes on the internal world model caused bythat stimulus. Table 2.2 shows a comparison of the components of both systems.

Dialogue Systems Agent SystemsInput Recognizer + NLU SensorsOutput Generators and Synthesiz-

ersActuators

Short termMemory

Dialogue Context Autobiographical Memory(FAtiMA)

Long termMemory

Reference Knowledge Base

Plannning Task Manager + Planner+ Scheduler

Planner (if it has a delib-erative layer)

Introspection Interpretation e Behav-ioral Agent

Reactive and DeliberativeLayers

Table 2.2: Comparison between components of Agent Systems and Dialogue Systems

With this information, it is possible to create an interactive system that combines both systems. Thisarchitecture is presented in Chapter 4.

13

Chapter 3

Interpretation and Generation

We start presenting our system by describing its memory. One of the key aspects of our work is howwe can combine the agent’s world knowledge with linguistic information. As described on Section 2.2.3,FAtiMA original memory model is comprised of two components, Knowledge Base and AutobiographicalMemory. The Knowledge base contains semantic knowledge about the world in a form reminiscent offirst order logic. Also contains a grounding mechanism that returns a set of possible grounds for unboundvariables on a given form. This makes it possible to store properties, facts, relations, etc.. on memory.Our system’s memory extends this component by creating a two layer system, one comprising of a worldrepresentation described as an ontology, and the other as linguistic annotations. This representation isinspired by EuroWordNet ([25]). We present our model on Section 3.1. We also show how to obtainontological concepts from dialog utterances on Section 3.2.

3.1 Model

In our model we present leverages of a simple ontological model with a linguistic concept by annotatingmemory entities and relationships with language senses.

Our system is designed to operate on an Open Domain. As such, even with a small set of tasks,polysemy phenomena can be a problem. Suppose that an Open Domain Dialog System has the capacityto find spherical objects on a given space and buying items on on-line stores. Asking such a system to“find a blue ball” can have multiple senses (e.g. find an spherical objects of color blue or acquire anUnion musket ball from the American Civil War). Therefore our model must facilitate the differentiationof concepts with similar words but different senses.

3.1.1 Linguistic Information

The linguistic information in the system’s memory can be multi-lingual. In each language, word sensescan have semantic relations with other senses such as synonymous, hypernyms, among others. Theserelations make each language a linguistic ontology. For this we used WordNet [19].

Also, each language possesses a grammar description: each verb on the language has an associatedset of structural (NP VP NP) and semantic (Agent V Object) relations between itself and its arguments.We call this structure a Frame. Information for the Frames was obtained from [15]. Given that a verbcan have multiple structural representations and senses, there must be an association between senses andframes, which we call FrameSet. This association allows a semantic separation, for instance, between “tofind” (acquiring) and “to find” (discovering). An example of these relations is presented in Figure 3.1.

Each utterance given by the user is processed by a natural language processing chain. The result of

14

this chain is a syntactic structure. Finally, this structure can also support associations between sensesand its parts.

3.1.2 Knowledge Organization

In addition to linguistic information, the dialog system contains information about the concepts thatit can reason about. Information in the system’s memory may contain themes like physical objects,chromatic properties, location and geometry dimensions, etc.. These entities will be matched againstwhat the user said, in the process of obtaining a meaning for the sentence.

Furthermore the memory contains a description of the system components, competences and thedevices connected to it. As mentioned before, they can interact using sensors and actuators in the agent’sphysical body (robot). Each of these devices are describe in the agent mind as well the competenciesthat operate them, containing an description of what it is, what it does and what it affects. This allowsground expressions with possible competences (and that way with devices) that can interact with suchan entity. However is even possible to speak about concepts for which the system does not have aformal definition. This can be seen in the following example: consider a user who asks for an object notdescribed in the agent’s memory. In this scenario, the system would not be able to ground the linguisticconcepts to entities in the described world. It would require a definition, which if provided by the user,would be evaluated and matched against known concepts in the memory, probably some ground withsome competences. After acquiring all the needed information, it would be capable of combining a set ofcompetencies which would act cooperatively, based on each individual property on the new definitions,for that specific purpose.

Verb Senses, found during the linguistic process, that are meaningful for the system, i.e. it is possibleto create a plan for them, are also associated with execution strategies with restrictions. We call thisassociation an α-structure as detailed in Figure 3.1.

Verb FrameSet

Sense Sense... Frame Frame...

α

Execution Strategy Execution Strategy...

Restriction1 Restrictionn

Figure 3.1: Relations between Verb, Sense, FrameSet, Frame and α-structure.

The α-structure allows the separation of equal senses due to the restrictions in different executionstrategies: even though “find something” may represent a sense of searching for something, the act offinding a physical object in a space or finding a person in a building are essentially two different tasksrequiring different plans.

An execution strategy is an abstract plan that can serve different purposes, depending on the argu-ments it is called with. When the interpretation algorithm is executed, the α-structure will be associatedwith the utterance’s verb. The restrictions will allow choosing an execution strategy, having as base the

15

verb arguments and the context. As an example, consider searching for either a ball or a rubber duck:it may be essentially the same task, but the sensors and actuators required might be different. Theinstantiation of this execution strategy will be an executable plan.

3.1.3 Language and Knowledge Bridge

To enable the use of a frame system in an open domain it was necessary to develop an ontology memorymodel that merges the agent’s world knowledge with linguistic information.

In this domain representation, all ontology components are represented by OL nodes. These nodescontain two layers:

• The O layer contains a non-descriptive representation of a component. This layer maintains anentity typification and its relations with other entities.

• The L layer contains a list of senses. Each sense is associated with one language, as we will seedescribed in Section 3.1.1.

The O-ontology represents all the concepts that the system knows about and is able to handle. In thislayer, the concept of a ball would be connected to the notion of physical object, as shown on Figure 3.2.This object may have other properties, such as color or size. The L layer would allow intersectionsbetween each concept and their description in a given language. This way, different concepts could shareequal words without the danger of ambiguity.

Ball Bola Pelota is é es Object Objecto Objecto

Len Lpt Les

L:

O: x01

Sen31 Spt24 Ses18

L:

O: x20

Sen5 Spt13 Ses20

L:

O: x90

Sen19 Spt30 Ses3

Figure 3.2: Ontological and linguistic representation of the relationship between “Ball” and “Object”.

3.2 Interpretation

When a new utterance is detected by the system, this is initially processed by the natural languageprocessing chain. The resulting tree, containing the verb and its arguments, is passed along with thelanguage in which it was processed to the Interpret algorithm (Figure 3.3). This algorithm matcheswhat was said to a meaningful structure in the system memory. In this algorithm, a list of FrameSets

16

is obtained from the sentence’s verb. For each member of this list, Sound determines if the sentencestructure matches some of the Frames in the FrameSet. If it does, all the possible meanings obtainedby the combination of word senses are going to be generated. We denote the by t and l the utterancesyntactic tree and language, Strategies(m, f) the set of execution strategies for a given frame andmeaning. Also Valid(es,m) returns if that execution strategy can be executed with that meaning.

1: Interpret(t,l):2: lFrameSet ←FrameSets(Verb(t), l)3: r ← []4: for fs in lFrameSet do5: if Sound(fs, t) then6: lMeaning ←Meanings(fs, t, l)7: f ←Frame(fs, t)8: for m in lMeaning do9: for es in Strategies(m, f) do

10: if Valid(es,m) then11: Push(r,Instantiate(es,m))12: end if13: end for14: end for15: end if16: end for17: return r

Figure 3.3: Interpret algorithm

3.2.1 Generating Meaning Combinations

Combinations (Figure 3.4) creates a list of three copies with all possible combinations of senses for theverb arguments. Ask is going to query the memory for all the senses of each argument. If an argumentis a compound word (e.g. “the blue ball”) and it is not represented in the memory, a structure is createdin memory containing the combination of all the words. The latter would mean that this structure forour example would be associated with the concepts “blue” and “ball”.

1: Combinations(t,l):2: r ← [t]3: for arg in Args(t) do4: temp← []5: for ti in r do6: known←Ask(arg, l)7: if Length(known) = 0 then8: known←Inquiry(arg, l)9: end if

10: for s in known do11: t′i ←Copy(ti)12: Set-Sense(arg, t′i, s)13: Push(temp, t′i)14: end for15: end for16: r ← temp17: end for18: return r

Figure 3.4: Combinations algorithm

If no sense is found for an argument, meaning that this concept is not represented in memory or a

17

mapping between this language and this concept does not exist, an Inquiry is called for the argument.This action will suspend the current computation and probe for a sense: querying the user for the sense,or executing a knowledge augmentation algorithm over the memory can achieve this. Once a valid sensehas been obtained for the argument, the computation will resume.

Combinations will return a list with trees annotated with senses. This list must then be combinedwith all the α-structures provided by the current FrameSet senses. This is done on Meaning. The finallist contains all possible meanings that the memory can provide for that sentence. A description for thisfunction is provided in Figure 3.5.

1: Meanings(fs,t,l):2: r ← []3: for s in Senses(fs) do4: α←Find-α(s)5: for ti in Combinations(t, l) do6: Set-Sense(Verb(ti), ti, α)7: Push(r, ti)8: end for9: end for

10: return r

Figure 3.5: Meanings algorithm

3.2.2 Matching Meaning with Valid Actions

After generating frame candidates in the previous steps, each of the elements in the list returned by theMeanings will be validated. For every execution strategy in an element, the restrictions associated withit will be matched against the argument senses of that element. If they can be matched, the executionplan is instantiated and the result is collected. If not, it is discarded.

If the result list of Interpret is unitary, then there is only one possible interpretation. If containsmore than one element, then we have an ambiguity. The system can choose one of them, or ask the userwhat to do. If the list is empty, that means that the system understood all the concepts, but no actioncould be taken.

3.3 Generation

When the system requires expressing itself, it needs to generate an utterance. There are two ways ofproducing them. The first is using a template system that creates utterances using a pre-fabricated,generic sentence that has empty spaces to be filled with appropriate content. This type of system isthe most simple to implement but leads to weird dialog and is very restrictive, being hard to handleunexpected situations. In this type of system either the agent has a template to express something,simply cannot do it.

The other way is to use a generative process. First a Discourse Planner receives the communicationgoal and with it selects the entities from the knowledge base, selects the words to be used that can begathered from the senses associated with that entity and the discourse history. With this information itstructures the content resulting a discourse plan. The next step is the Surface Realizer. This componentreceives this plan and generates the sentences.

We can see the find similarities between both processes and the reactive and deliberative layer fromagent systems. The first is simple and fast but inflexible. The second is complex and harder to generatebut is adaptable and the output is richer.

18

Chapter 4

System Architecture

In this chapter we present an overview of our architecture. The main idea of this architecture is thatdialogue can use the agent system’s world semantic knowledge and emotional state. As such, the DialogManager must be embedded inside the agent system’s Decision Engine. What we want to achieve is tointegrate elements from the Dialogue Manager inside the Agent System.

Our architecture is influenced by Project LIREC1. This project aims to create a new generation ofinteractive, emotionally intelligent companions that is capable of long-term relationships with humans.The architecture for this project is composed of three levels. We chose this project because it uses FAtiMAas the system mind, and from the overview of Chapter 2 it was one of the architectures that implementsan emontional state. On Figure 4.1 we present the LIREC Architecture2.

Level 1 is composed of hardware components or in a more abstract way, their API’s. Level 2 iscomposed of competencies. These are an abstraction level between the mind’s perceptions and sensorsand the hardware calls. They implement higher-level hardware capabilites (or low level mind actions andsensors) like facial recognition, sound processing, location awareness, etc. Competencies can combineseveral sensors and actuators to create advanced competencies. Also, competencies can interact with themind’s knowledge base. For instance, when a ball is detected on a camera, a competence responsible fordetecting balls can add new knowledge like color, size or distance. Level 3 is an instance of FAtiMA.

Based on the study performed on Section 2.3 and the capacities able on each system, we are capableto present a system that extends the behavior of Levels 2 and 3. In particular, we are going to extendthe agent’s memory adding an ontology with linguistic tags as described in Chapter 3 and a direct accessto memory from Level 2.

4.1 Input

We designed the system to be interacted with in various ways, but the preferred way for a user to interactwith it is by speaking. However, the physical agent, while roving a set, might find objects that change itsview of the world and therefore its actions. Next, we start by explaining how utterances from the userare processed in our system.

4.1.1 Linguistic Input

As previously mentioned, Level 2 competences work as an implementation of sensors and actuators, andwork as an indirection between mind and hardware. It is in this component that we implement the

1European Union Project under grant agreement no. 2155542http://trac.lirec.eu/wiki/AgentArchitecture

19

Figure 4.1: LIREC’s Architecture

linguistic input and output of our system, as competences.A user can interact with our system by using a terminal or a speech recognition component. Either

one will yield a string representing an utterance. After that, this string is passed to our natural languageprocessing (NLP) chain. We chose to adopt a known architecture of NLP for our linguistic processing,modifying our architecture to benefit from the system memory. A representation of the modules in ourprocessing chain is shown in Figure 4.2

Syntatic Analyser Sense Tagger Frame

InstanciationFrame

Arguments Interpretation

Frame Validation

String

Figure 4.2: Input natural language processing chain

The syntactic analyzer receives a string and produces a tree representing the phrase syntactic struc-ture. This tree is passed to the Sense Tagger. This component will find senses for each leaf on the tree.These senses are obtained from the WordNet. All senses found for a given word and Part Of Speech(POS) are stored on the tree.

The next three modules - Frame Instantiation, Frame Arguments Interpretation, Frame validation -execute the algorithms presented on Section 3.2. The tagged tree is then passed to Frame Instantiation.

20

This component will attempt to find a frame matching the verb from the utterance. Frames are describedin Section 3.1.1.

Each verb may have several associated frames, varying in argument types and number. Only theframes structurally sound with the tree (the frame structural representation matches the tree) are goingto be instantiated. In Figure 4.3 we see the instantiation of two frames, from a tree generated from thesentence “Rose find a ball”. It is important to notice that more than one frame is instantiated, because“find” can mean “discover” and “obtain”.

NNP VBP DT NN

NP

VP

S

Rose find a ball

NP

SS S S

Rose find a ball

Role:AgentPOS:NP

Role:VPOS:V

Role:ThemePOS:NP

Rose find a ball

Role:AgentPOS:NP

Role:VPOS:V

Role:ThemePOS:NP

get-13.5.1

discover-84

Figure 4.3: Frame Instantiation from a Tagged Syntactic Tree

The next step is to associate a memory entity to each of the arguments and for that we need tointerpret them. The Frame Argument Interpretation analyses each argument word and detects whatthat argument is referencing. Arguments may be composed by single words (e.g. “Rose”) or compositephrases (“a ball”). In the first case we are referencing explicitly a memory entity, while on the secondcase we are referring to an unknown object. For composed phrases a parser will attempt to detect what isbeing referenced. If the parser detects an indefinite article, then the entity being mentioned is unknownto the system, therefore a new entity is created extending its properties from the one mentioned next onthe phrase. If is a definite article then the entity referenced by it should exist on memory and may havebeen referenced before.

Previously sense tagging may collect several senses for each word and every of those senses mighthave an entity associated in memory. Consequently, we have to create all possible combinations betweenentities.

There is always the possibility of an entity for an argument not being found. This means that theentity we are trying to reference is not known to the system or the reference cannot be infered by thesystem knowledge. In this case an error is send to the mind in order for the user to provide furtherclarification. It is also important to understand that entities are not added to the knowledge base in thisphase. At this point, only “promises” of new entities are added. If this line of interpretation is provencorrect, using the restrictions, the entities are then added.

When the interpretation phase ends, we now have a (likely large) set of possible interpretations forthe utterance. Many of them may be nonsense while others are not possible to be done by the system.What detects what action can be performed is the utterance verb. When the interpretation occurs,an α-structure is appended to the utterance verb. This structure contains several possible actionsassociated with restrictions. The restrictions are logical assertions that can be made having as base theframe arguments. Suppose the verb “find” has an action capable of finding balls on a given space. The

21

restriction for this action might be:

∃x : Is(Robot, {Agent}) ∧ Contains({Agent}, x) ∧Able(x,Detect, Ball) ∧ Is(Ball, {Theme})

When validating a frame, first we replace {Agent} and {Theme} for the entity inferred before, andthen we evaluate this expression against our knowledge base. If the assertion yields true then that actionis a valid match for that frame’s interpretation of the user utterance.

All the frames instantiated previously are going to be evaluated against the restrictions of all possibleactions for that verb. After this is done, only one possible action is supposed to be valid. If the systemends up having more than one valid action for an utterance, then we are in a situation of ambiguity. Ifno valid action is found then the system does not know what that utterance means. In both these casesan error is raised to the mind, and a question is done to the user on what he desires to do. Once thataction is found we can now get the goal to activate.

The speech acts that can be interpreted by the system are only limited by the linguistic informationadded to the agent’s memory. Also, as information is added to the system, it can handle more complexdialogue.

4.1.2 Visual Input

Our architecture is designed to be embodied on physical agents and because of that we have competencesassociated with physical agents sensors like visual detectors capable of detecting objects (balls, shapes,faces, etc.). These competences are responsible for gathering information about the external world tothe agent and are also responsible for the expanding its knowledge. This makes it possible to interactwith physical agents about objects in the world using verbal communication, and referring to informationunknown to the system.

4.2 Decision Making

All decisions in our system are made by an instance of FAtiMA. As mentioned before, there are twoconcepts used on FAtiMA’s mind: Goals and Actions. Goals are instantiated plans and set of actionswith unbound variables. Actions are things that can be performed by the mind. In our architecture,actions can be connected to competences that perform jobs on Level 1 hardware.

When a goal is found for an utterance, the interpretation competence will activate the associated goal.This selection is made, as seen in Section 4.1, by evaluating the restriction against the frame entities andknowledge base.

When the goal is activated, FAtiMA’s emotional continuous planner will use the goal’s preconditionsto ground the possible variables that could be used to instantiate the goal. It is important to notice thatalthough similar preconditions and restrictions are used differently on our system. Restrictions are usedto select a goal, while preconditions are used to select valid variables that can be used to instanciate thegoal. On a deployment of FAtiMA, preconditions might not be valid when goals are active. In fact, whenvariables change or are created due to interaction with the outside world, goals can become instantiated.In our system, this is not supposed to happen. Preconditions only select the variables that are going toinstantiate the goal.

After the goal is instantiated, all actions declared will be executed. When the event of a actionbeing made is detected by the competency manager, and if a competence is associated with it, then thecompetence is going to be invoked.

We also make use of FAtiMA’s emotional model. FAtiMA’s goals can produce emotional changes to

22

the agent internal state. We make use of this in two ways. First, world effects including user utterancescan change the agent emotional state either positively or negatively. Examples of this is re enforcing theagents actions by complementing it or kicking it and expect a negative emotional response. On the otherhand, the agent can respond negatively to these changes. After complementing it, the agent can producesome motion to show its satisfaction - like a dog wagging its tail when it is being petted.

4.3 Memory

The memory in our system is not changed from what FAtiMA already provides, however, we extendits functionality. As explained on Chapter 3, we use a two layer system that combines both linguisticconcepts and ontological knowledge.

When the system starts, this ontology is populated with some knowledge that allows the system tointeract with the user and relate to world. This knowledge must contain basic statements about theworld, but also requires the agent to be described in it. The reason behind this is the need for thesystem to reason about it self when evaluating the α-structurerestrictions, and when grounding variablesto instantiate the goal. This allows for instance having the goal of finding objects to be used to detectballs and QR-codes, by selecting the appropriate detector.

Another important change is the capacity of Level 2 to access, change and ground expressions on theknowledge base. We added a channel that allows Level 2 Competencies to interact with the mind.

4.4 Output

This system is also designed to express itself on the physical world. It can do it in two ways. The first isby verbally addressing the user, the other is by executing some action in the world. Both are implementedas Level 2 Competences.

4.4.1 Speaking

When the system requires to address the user about some question it does via speech. Our architectureprovisions a two-way processing system in a Level 2 Competence, to deal with reactive and deliberativeprocesses as can be seen on Figure 4.4.

Discourse Planning

Surface Realizer

Text To Speech

Comunicative Goal

Figure 4.4: Natural Language Generation processing chain

The Discourse Planner receives a communication goal and queries the knowledge base for the relevantinformation producing a discourse plan that is passed to the Surface Realization. This component willuse the plan and produce an utterance. Transforming the plan in a functional description and unifying itwith a functional grammar can do achieve this. It is possible that the Discourse Planner has a template

23

for the communication goal. In this case it gets the template and fills the empty spots with the selectedentities, and the Surface Realization works as an identity function.

The resulting outcome will be a string representing the text to be spoken. This text is then sendto the Text-To-Speech and an audio output is produced. The Text To Speech can be affected by theemotional state of the agent, and produce speech with emotional inflection.

4.4.2 Movement

Since the system is connected to a physical agent, needs the ability to move. For this to be possible,the agent’s movement abilities need to be described in memory. Also, this requires the agent to containactions and goals related to those abilities. The actions are connected to Level 2 Competences thattranslate the mind requests into physical movements of the robot.

4.5 The complete Architecture

In the previous sections we saw some of the components that compose our architecture. First, wedescribed how input is processed by either being natural language or image. Next, we presented howdecision-making is preformed by the system mind and memory interacts with the system. Finally, wedescribed how the system can express itself, either by sound or movement. In Figure 4.5, we present thecomplete architecture, with all the components previously presented combined.

Figure 4.5: Our adaptation of LIREC’s Architecture

24

Chapter 5

Implementation

This chapter presents a prototype implementing the architecture proposed in the previous chapter andthe integration with a physical robot. We start by presenting the internal function of both our Level2 and Level 3 implementations, followed by the description of the competencies used. In Figure 5.1 wepresent our implementation architecture, with Level 2 comptencies and Level 1 devices.

Movement and Robot Sensory

Competency

Input Competency

Speech Competency

Rovio

Julius

Mary TTS

Mind

Level 1 Level 2 Level 3

Competency Manager

Ontology

Figure 5.1: Our implementation architecture

5.1 Mind and Competence Manager

As previously stated, our system is influenced by project LIREC. We choose to maintain our work closeto the one being made by the project so we could use their resources. Therefore, we chose to use FAtiMAas our Level 3 mind. For Level 2, we used CMION, a competence manager that connects to FAtiMAand manages all the competences providing background and action-driven processing. It works as amiddleware layer between the mind and the hardware. Both systems are implemented in Java.

25

CMION competences are Java Classes that extend the Class Competence and implement the followingmethods:

runsInBackground Returns a Boolean value that declares if the competence is supposed to be abackground competence or if is supposed to be called when an given action happens.

initialization This method implements custom initialization required for the competence to func-tion.

competence Code The main code for the competence is implemented in this method. Receivesa set of parameters and their values, which are defined on the competence rule file. It returns aBoolean value representing if the action was a success or a failure

Each competence runs on separated threads, so it is possible for two competences to be running atthe same time.

CMION configuration is made with two XML files. The first is the competence library that declaresthe competences and the Java classes that implement them. The second file is the competence managerrules. It is in this file that mind actions are associated with a competence. An example of a competencerule is present next.

<Rule>

<MindAction Name="Say">

<Parameter No="1" Value="*"/>

</MindAction>

<CompetencyExecutionPlan>

<Competency ID="1" Type="say">

<CompetencyParameter Name="Entity" Value="$parameter1"/>

</Competency>

</CompetencyExecutionPlan>

</Rule>

The first node, MindAction, declares the mind action with which competence is going to be associated.In this case, the action will be Say([what]) since this action takes a parameter we need to declared it.This is important because for FAtiMA, Say() and Say([what]) are two different actions.

The second node represents the competence plan. It is possible to use more than one competence fora given action. In this case we are only using one and we are associating the action parameter ([what])to the parameter Entity in the competence.

The systems starts when first CMION is launched and then awaits for FAtiMA to connect. It spawnsa window presenting some debug information about each competence and the events beings received fromFAtiMA. Communication between the two software modules is made by FAtiMA’s own communicationprotocol.

Each installation of FAtiMA contains a directory called data with configurations for each possiblecharacter that the mind may implement. Each character represents a set of traits, default goals, In-terpersonal relations values and default emotional responses. For our system we needed to define thecharacter. Because our system was designed to test the input stream we left all the emotional traits withthe same value and we also removed all default goals. The Level 2 competences can add them when theyneed it.

Goals are defined on the Goal library file. In this XML file, are all the goals FAtiMA loads when itstarts. A goal is presented next:

26

Figure 5.2: CMION main window, awaiting for FAtiMA.

<ActivePursuitGoal name="Sense$1([agent],[theme])">

<PreConditions>

<Property name="HAS([agent],[x])" operator="=" value="True"/>

<Property name="IS([x],SENSOR)" operator="=" value="True"/>

<Property name="ABLE([x],DETECT,[y])" operator="=" value="True"/>

<Property name="IS([theme],[y])" operator="=" value="True"/>

</PreConditions>

<SucessConditions>

<RecentEvent occurred="True"

subject="[SELF]"

action="FindBall" target="[theme]"/>

</SucessConditions>

<FailureConditions>

</FailureConditions>

<ExpectedEffects>

</ExpectedEffects>

</ActivePursuitGoal>

A goal is referenced by its name. The expressions within brackets ([agent] and [theme]) representthe unbound variables. The instantiation of a goal is made when these variables are matched with someentity on the knowledge base, and that instantiation may only occur one time during the execution ofFAtiMA.

The first node, PreConditions, represent the logic assertion that must be valid for the goal to beactivated. In this part we present two other unbound variables ([x] and [y). All of these variables mustbe proven bindable to some entity for the goal to be instantiated. In the goal presented above, theprecondition can be represented by the following logic statement:

∃agent,theme,x,y∈KB : Has(agent, x) ∨ Is(x, Sensor) ∨Able(x,Detect, y) ∨ Is(theme, y)

The next node, SucessConditions, expresses the changes or effects that the goal wishes to producein the world state. In this case we are referring that we pretend to execute the action FindBall(theme),with theme being the goal bounded value.

If any of the of the conditions present in FailureConditions becomes true while the plan is beingexecuted then the goal fails. Finally, ExpectedEffects represents the changes on the agent’s emotionalstate.

FAtiMA’s actions are defined by three elements, Preconditions, Effects, and EffectsOnDrives.These represent respectively, a logic statement similar to the goal’s that ensures if that action can beperformed, the expected results from executing that action and emotional changes produced by theexecution of this action. In our implementation, actions are normally empty of any of these elements,

27

simply because most of these are implemented using competences.After FAtiMA is launched, a window is presented. In this window is possible to view information

about the mind (Knowledge base information, emotional state, goals and intentions, etc.).

Figure 5.3: FAtiMA main window, showing the knowledge base

By default, FAtiMA can inform CMION about actions being executed and CMION returns the resultof the competence execution. An action is only finished when the competence is done and the success orfailure of executing this action is dependent from the competence result. We extended this behavior byadding a few more capabilities. CMION is capable of reading and writing information to the agent’s mind,but we need to have full access to the agent’s mental capacities (e.g. grounding unbounded expressions)to perform the linguistic analysis. Therefore we created a small Remote Procedure Call from level 3 tolevel 2 so we could fully access the mind’s memory.

After CMION and FAtiMA are connected, the system enters a waiting state. Only when competencesdetect new input and changes are made to the mind it starts working again.

Since no goal is present when the system starts, when a competence needs to activate some kind ofmind action it starts by adding this goal to the goals active set. By then all the necessary entities to thegoal instantiation should be in memory and this is automatically instantiated.

When the goal is instantiated, the actions described are going to be executed. CMION will be informedof its execution and if a competence is associated with it, the relevant code will be executed.

5.2 Competences

Next we present the competences we created for our system. These allow the system to interact with theexternal environment either being a microphone, speaker, or a body.

28

5.2.1 Input Competence

When we start developing our system we decided to only use English as a input language. Although oursystem is designed to handle several languages as described in Chapter 3, the purpose of this thesis isnot to create a multilanguage system, but to merge two independent systems. Also there are more toolsavailable for Natural Language processing available for English that for any other language.

In order to communicate with the system, we used Julius ([1]), a automatic speech recognizer (ASR).We chose this ASR because it was open source and could be simply configured easily. However, the resultwe obtained with large grammars and vocabularies proved disappointing. Because of that, we decided totrain this ASR with a simple grammar and vocabulary. We also connected it to a small client so it couldchose an pre-defined phrase using word spotting analysis on the input , allowing us to use more complexphrases. The speech recognition is done at Level 3 and communicates with the rest of the architectureover a stream of data. After the speech recognition is done, the input is send to the Input Competence.In this competence we implemented the Natural Language Processing Chain described on Section 4.1.1.

The first step of this chain is to pass the user input to a syntactic analyzer. This analyzer was wrotefor this project and works as a Web Service. We chose this because the analysis is CPU-intensive andwas being done on a separate machine. The analyzer is composed of two software pieces: The StanfordNLP Parser [16] and Dan Bikel’s Multilingual Statistical Parsing Engine [5]. The first is a morphologicalparser and the second a syntactical parser. The reason for using them two was because they are simpleto use, are pre processed for English and are widely used. The output is a tree representing the verb andarguments, with words as leafs.

The next step is to annotate the utterance with senses. In order to that, we loaded WordNet Version3 into our system. To query the linguistic ontology we used the Java WordNet Library1. For each leaf inthe tree we are going to store all senses possible for that word used in that part of speech.

At this point, the system must find a valid phrase structure. For that we used VerbNet 3.1 and webuilt a querying system on top of that data. Once the system finds a matching frame for the input, itinstantiates a Frame object. This object contains a list of FrameElements, which are populated with theverb and its arguments. The matching process is accomplished by comparing the structure of utterancewith the one declared by the VerbNet on the sense associated with the verb.

At this point we end up with a list of Frames. The next step is to find entities that match thearguments of the verb. For this we implemented a small interpreter parses each of the arguments andattempts to match it with a valid entity on the Knowledge Base. The interpreter works as follows:

• If the FrameElement is of type NP (Noun Phrase):

– If it only contains one word then first search for the word itself in order to find the entities.This is done because mentioned entities have no senses associated, they are referenced by theirname. If no object is found with that name, next we search for the senses related to thatword. If any is found, it is associated with the FrameElement and we continue to the nextFrameElement

– If it contains more than two words, then it is a compound word. In this, case we start analyzingeach word, from left to right. If the first word is a DET (Determiner) and is a know Determinerselects the appropriate action:

∗ If it is “a” we are attempting to find a unknown entity. A knowledge base object is createdand associated with the FrameElement. While analyzing the following words we copy theirproperties into this newly created entity, therefor creating a instance of what the user saidnext.

1http://sourceforge.net/projects/jwordnet/

29

∗ If it is “the” then we mark this entity as a specific knowledge base entity that must beknown at this time. The next words must specify the entity that is being searched.

• If the FrameElement is of type V* (Any kind of verb), do nothing.

After this point it is expected for the Frame to be fully annotated with all the entities. The next stepis to get the Verb and find the restrictions associated with it.

In our system, the restrictions and their related goals are store in the mind, associated with the verb’ssenses. For the sense of the verb on which the frame was selected, we are going to see if there are anyrestrictions associated. If none it is found then this frame is not executable by the mind. If one is found,the an evaluator is called with the logic expression with substitutions made to reflect the entities selectby the interpreter.

The evaluator is implemented by grounding each of the sub expressions on the restriction againstthe knowledge base. After that they are all unified using a unifier implemented from the descriptionpresent in [14]. The end result is either a Failure or a structure with all possible associations between theunbound variables and their possible values. If such an association is found then the frame is valid, thegoal is retrieved from the knowledge base and is activated.

5.2.2 Ontology Loading

In order to introduce the memory model presented on Chapter 3 into FAtiMA’s memory, we created alanguage that expresses the ontology we intended the system to know as soon as it starts. This languageis a small domain-specific language build with lisp.

The language allows the description of the ontology on sets of entities and relations. Each OL nodesis represented as:

(def-ol-node <entity-id> (<sense-id1> <sense-id2> ...) <relations>)

As previously stated, all information in the mind is supposed to be annotated with linguistic infor-mation. Therefore, relations and entities are required to be OL nodes. Relations are described as aassociation of two or more entities. For instance if we want to say that able to move forward, we wouldadd (able move front) and define the concepts able, move and forward. This will yield the associationAble(Move, Front). A small sample of our ontology is present on Figure 5.4.

When the ontology is complied, the OL nodes and relations are transformed into knowledge basestatements. When the system starts these statements are added to the Knowledge Base.

5.2.3 Movement and Robot Sensory Competence

This system is connected to a physical body, a robot called Rovio 2. This body is a 3-wheel roving robotwith a 640 × 480 webcam, microphone, speaker, and infrared barrier detector. An image of the robotcan be seen in Figure 5.5. The robot is controlled using internal an Web Service that receives commandsusing HTTP.

A process that receives requests from our system directly controls Rovio. This process is responsiblefor moving the robot, raising and lowering its head and receiving information from the camera. In thesystem there is a competence responsible for the motion of the agent that communicates with this process.

In this process, we also used object detectors [10] using Rovio’s camera. This is made possible usingOpenCV, a computer vision library. Our initial objective was to have a always on detector that when anew ball was found, contacted the mind informing it. However the results were unsatisfactory due to alarge number of false-positives. An example of our detection system result is on Figure 5.6.

2http://www.wowwee.com/en/products/tech/telepresence/rovio/rovio

30

(def-ol-node is ("be\%2:42:03::") nil)(def-ol-node contains ("contain\%2:42:00::") nil)(def-ol-node object ("object\%1:03:00::") nil)(def-ol-node property ("property\%1:07:00::") nil)(def-ol-node element ("element\%1:09:00::") nil)(def-ol-node mind ("mind\%1:09:00:") nil)(def-ol-node entity ("entity\%1:03:00::")((is object)))

(def-ol-node actuator ("actuator\%1:06:00::")((is element)))

(def-ol-node sensor ("sensor\%1:06:00::")((is element)))

(def-ol-node ball ("ball\%1:06:03::")((is object)))

(def-ol-node robot ("robot\%1:06:00::")((is entity)(contains actuator)(contains sensor)(contains mind)))

Figure 5.4: A sample of our ontology description

Figure 5.5: Our physical body, Rovio

31

Figure 5.6: The circle detection output

The approach we chose was to have the capacity of searching for objects being one of the body’scapabilities, and that would call a specific naıve search algorithm. This algorithm simply rotates therobot in small increments and detects if any circle is in sight. If it is, the average color inside the radiusof the circle is calculated and if its same value as the requested the algorithm considers it a ball.

5.2.4 Speech Competence

In our prototype we decided to use only a template system. Although limited, it is powerful enough toshow the interaction capacities of the system.

When a part of the system requires speaking it at adds some information to the memory (what is goingto be said, who is addressing, etc) and activates the speaking goal. This goal is going to be instantiatedwith the utterance that the system requires to speak. The plan for this goal may contain movementaction as position the physical agent near the person, or object to be addressed.

This goal has an action specific to the verbal interaction with the user and is associated to the speechcompetence. When this competence is executed, the utterance selected to be spoken is retrieved fromthe mind and passed to our Text-To-Speech system, MARY TTS [23]. We choose this system because itwas simple to use and freely available.

The output of the TTS is then sent to computer speaker. It is possible to send it to Rovio’s speaker,but the sound produced by the robot’s speaker is very low and the results obtained from experimentswere unsatisfactory.

5.3 Comments

In this section we described the choices made in the implementation of a prototype, instantiated from thearchitecture present in Chapter 4. This architecture is composed of 3 Competences that handle linguisticinput, output and vision. Although our implementation is simple, we the objective of this work is toprove the feasibility of such architecture.

32

Chapter 6

Evaluation

So far, we presented an architecture that combines both aspects of dialogue systems and agent systems.However, now we need to verify if our system is capable of expressing emotional changes. In order toevaluate our system, we devised a scenario in which a user asks a robot to search for balls. The robot startssearching for balls and when it finds one, it alerts the user by making a sound and verbally addressingthe user. Our objective with this scenario is to test the entire linguistic pipeline, while using the situatedagent on the set and make use of the emotional feedback from the agent system.

We simulate emotional expressiveness in Rovio by using motion or sounds. When it wants to addressthe user, it turns to him and raises its head. Each time the agent experiences an emotional change itplays a short sound. The sounds chosen are from the robot R2D2 of the Star Wars saga. The reasonbehind this choice is that R2D2 is a well-known robot about which most people have a positive opinion.

Each time our robot finds a different ball, the system receives a positive re-enforcement. Also weadded the interjection “Great!” for when a user wishes to compliment the system. Although FAtiMA hasemotional state, we chose to implement the emotional feedback using variables on the system’s memory.The reason behind this was the need from CMION to access this information. The implementation maybe different but the, outcomes are similar.

6.1 Scenario

In our scenario, a Rovio robot is placed in a set with three balls, with of different colors. A user is placednext to the robot and addresses it in order to perform some tasks.

This robot is connected to an instance of our system and contains knowledge about searching for balls.Also it is connected to a competency capable of detecting balls. The user has a microphone connectedto the system. An external judge will view the entire scene, not interacting in it and will answer somequestions about the scenario.

6.2 Methodology

6.2.1 Objective

This evaluation was designed to determine if the emotional changes being made in the mind could betranslated to perceptive reactions made from the body.

33

6.2.2 Population

The inquiries were made on the Internet. That caused some discrepancies on the population size betweenscenarios due to the evaluation from each scenario being made independently.

For the first scenario forty-five inquiries were made, with sixteen females (36% of the population) andtwenty-nine males (64% of the population). Table 6.1 shows the age distribution for this population.

Age ranges Freq.18 to 23 523 to 28 1428 to 33 933 to 38 938 to 43 5Over 43 3

Table 6.1: Age distribution for the first scenario inquiry

For the second scenario forty inquires were made with sixteen females (40% of the population) andtwenty-four males (60% of the population). Table 6.2 is shows the age distribution for this population.

Age ranges Freq.18 to 23 423 to 28 1428 to 33 833 to 38 638 to 43 3Over 43 5

Table 6.2: Age distribution for the second scenario inquiry

6.2.3 Materials

The system was installed in a computer connected to a microphone and to another computer controllingthe robot. The scenario was recorded using two cameras, one viewing the scene, and another showingthe user.

6.2.4 Procedure

In this scenario, there are two actors, one is the user that interacts with the system and another is theobserver who does not participate directly in the actions. The scenario starts with the robot standingright in front of the user and turned towards him. The user then interacts with the system using themicrophone.

We performed two interaction experiments. In the first situation, we asked the robot to search forballs. When the robot finds a ball, it turns to the user asking for feedback. However, in this situation,the user only talks with the system to awaken it and to give a task. Below, we can see the dialogueestablished between the user and the system.

User Rose!

(Rovio raises its head)

Robot Yes?

User Find the balls.

34

Figure 6.1: The set used for evaluation

(Rovio lowers its head and starts searching for a ball, turning. When it finds a new ball, it beeps,stops and says:)

Robot Found a red ball.

(Turns to the user and raises the head. The user does not respond. After some time, it turns backto where it was continues the search. When it finds a new ball, it beeps, stops and says)

Robot Found a blue ball.

(Turns to the user and raises the head. The user once again does not respond. After some time, itturns back to where it was continues the search. When it finds a new ball, it beeps, stops and says)

Robot Found a yellow ball.

(And the robot stops.)

In the second scenario, the interaction starts as in the first. However, after finding the first ball, theuser congratulates the system for its work. It then gains confidence on its work and continues withoutaddressing the user. What we are attempting to convey here is the increase in confidence by the robotdue to the user’s interaction.

User Sarah!

(Rovio raises its head)

Robot Yes?

User Find the balls.

(Rovio lowers the head and starts searching for a ball, turning. When it finds a new ball, it beeps,stops and says)

Robot Found a red ball.

35

(Turns to the user and raises the head.)

User Great!

(The robot turns back to where it was and continues the search. When it finds a new ball, it beeps,stops and:)

Robot Found a blue ball.

(This time it continues searching without turning to the user. After finding the next ball:)

Robot Found a yellow ball.

(And the robot stops.)

In order to simplify the inquiry to the observers, it was decide to record both situations and show tothe observers a movie. After seeing each movie, the following questionare was made.

1. Did the emotional state of the robot change during the experiment?(Answers: Yes / No)

2. How did the robot express the emotional change?(Answers: Behavioral Change / Audio Feedback / Speech / Did not change)

3. Did the robot change its behavior after finding the red ball?(Answers: Yes / No)

4. Did the robot change its behavior after finding the blue ball?(Answers: Yes / No)

5. Did the robot change its behavior after the user spoke with it?(Answers: Yes / No)

After all inquiries were made, the results were collected and statical analysis preformed.

6.3 Results

6.3.1 Change on the emotional state of the agent

The first question is about wheter the observer felt that the emotional state of the system changed duringthe movie. Comparing the number of “Yes” answers between the first and second situation we can seethat there is an increase. This is easy to see on the chart plotted with this data (shown on Figure 6.8).

By Age

Separating the data with age we obtain the data present on charts in Figure 6.2, for the first situation,and Figure 6.3 for the second.

It is easy to notice that from the first situation for the second there is an increase on Yes on everycase except 28 to 33 and Over 43. On the case of range 28 to 33, the discrepancy is due a Male observerthat answered the first inquiry with “Yes” but did not answer the second inquiry. The other discrepancyis cause by a Male observer that answered “Yes” in the first situation and “No” on the second.

36

Figure 6.2: Emotional Detection by Age on the First Situation

Figure 6.3: Emotional Detection by Age on the Second Situation

37

By Gender

After separating the data according to gender, we obtain the charts shown in Figures 6.4 and 6.5, forfirst and second situations respectively.

From the data from this point, is also easy to see that the number o “Yes” answers increase from thefirst to the second scenario, for both genders.

Figure 6.4: Emotional Detection by Gender on the First Situation

Figure 6.5: Emotional Detection by Gender on the Second Situation

By Age and Gender

We also decided to drill down the data and separate it into both age and gender (charts present inFigures 6.6 and 6.7).

38

This approach also shows increase in the number of “Yes” answers for ranges both Female and Male,except for ranges 28 to 33 and Over 43 in the Male gender. The reasons for both these discrepancies isthe same as discussed on Section 6.3.1.

Figure 6.6: Emotional Detection by Age and Gender on the First Situation

Figure 6.7: Emotional Detection by Age and Gender on the Second Situation

The conclusion we can get from the data shown above is that the observers did detect a changeof behavior from the first to the second situation. Separating the data by age and gender lead us tointeresting conclusions even though our population is not big enough to reach a solid conclusion. We cansay that females tend to detect an emotional change in higher numbers. Also, they detected emotionalchanges in higher number, in comparison to males. Age does not seem to be a differentiating factor whendetecting emotional changes.

39

Figure 6.8: Overall Emotional Detection on both scenarios

6.3.2 Emotional expressiveness

The next question in the inquiry is about what how the agent expressed the emotional state.

• First Situation - Those who answered “Yes” to the first question on the first situation, 48% saidthat it was due to Audio Feedback from the robot, and 24% from Behavioral Changes.

• Second Situation - On the second situation, “Yes” to the first question has a much higher fre-quency. Also the number of people answering to the change was due to Behavioral Change (60%)or Audio Feedback (50%) increased.

Combining this data with the shown above, leads us to believe that in the first situation peoplethought that the sounds the robot was making when it detected a ball were a response to emotionalresponse. Also in this situation, Behavioral Change was given as an answer to an emotional change by24% of people. Of those, on the last three questions, the same observers reported consistently that thechange was provoked by user interaction. It may be that those observers where considering the initialgreeting or the command as the reason for the emotional change.

However on the second situation the answers inverted and people saw change in behavior as theresponse to the emotional change. They did however report that audio feedback did also played arelevant part as an emotional change response.

6.3.3 Final Remarks

The results attained, although very limited, lead us to believe that people detect the change of emotionalbehavior using our system, and therefore suggest that our system can be used to create Synthetic Entities.Although the population size is small, we can detect changes from the first to the second situation.

Even so, there were some problems. On the second situation, almost all who answered “Yes” to thefirst question did also indicate that a behavior change was caused by an human input. But, on the firstsituation, people indicated change due to user interaction when answering “Yes” to the first question.Also they indicated on both situations that changes in behavior were detected by finding the balls, butthe results are not consistent. This may be because the last three questions were not posed correctly andlead the observer to ambiguous answers.

40

Chapter 7

Conclusions & Future Work

7.1 Conclusions

The objectives we set on to accomplish with this thesis were:

• The integration of two systems (Dialogue Systems and Agent Systems) into a single interactivesystem

• Merging the information from both the systems.

For the second point we have shown in Chapter 3 a cognitive model that allows information from thetwo systems to be interacted from each side. In order to interact with such a system we also createdalgorithms to interpret user utterances and generate dialogue using this memory.

For the first point, we outlined an architecture in Chapter 4 that combines both agent and dialoguesystems. The user can interact with this system using its own voice, speak with it using natural languageand interact with the physical world using a robotic agent. We proposed an natural language processingchain that allows the description of the memory model described in Chapter 3.

In Chapter 5, we presented an implementation of a prototype created for this work based onthe archi-tecture presented on the previous chapter. This system was composed of a mind and four competencesthat implemented the needed tasks for a user to interact with this system. We implemented a domain-specific language that allows the description of our memory model. Also, we implemented the naturallanguage processing chain described in Chapter 4 that uses information from both WordNet and VerbNetin order to map utterances to knowledge base entities. The system was connected connected to a robotgiving it a physical presence. We implemented the necessary competencies to allow the robot to moveand used detectors that could find shapes and objects on a scene. We also implemented competenciesallowing the system to generate speech and produce audio.

With the intention of assessing the effectiveness and viability of our solution, we applied our prototypeinto a evaluation as shown in Chapter 6. In our scenario we created two situations in which the userasks the system to find balls and in one case congratulates it by its actions. We then proceeded toevaluating the data from people observing the experiment. Although small, the results obtained showthat this system can be used to create Synthetic Entities. Our evaluation also shows that users can detectemotional changes in the robot.

Thus the overall results show that the proposed architecture succeeds in achieving the solution to theobjectives purposed in this thesis.

41

7.2 Future Work

In this section, we present some lines of development that can extend the work presented in the previouschapters.

7.2.1 Advanced speech capabilities

Our speech generation uses a simple template system with predefined phrases. This leads to very simpleand not believable dialogs between the user and the agent. In order to create believable dialogs, theagent must be capable of expressing about the world or even about it self in free form. This requires asophisticated text generation system.

One possible solution that would improve the system is using a Text Generation system to producemore complex utterances, using the information on the agent’s mind to generate the phrases. Sinceall entities are annotated with linguistic information, we can use that same information in a generativeprocess. It would be necessary to select information the agent wishes to talk about, transform theinformation in a usable way (retrieving the linguistic form from the sense annotation and collecting thespeech time) and call a text generation grammar(like SURGE) to produce an utterance.

Another interesting line of development to explore is the ability for the system to explore the emotions.We used emotions in our experiments, but only using movement to express them. MARY TTS, the Text-To-Speech System used in this system is capable of modulating the speech to express emotions. Anotherline of development would be tagging the utterances from the system with emotional moods so the TTScould modulate emotions while synthesizing speech.

7.2.2 Agent to agent communication

The presented system is designed to interact with humans, but future interactive systems may need tocommunicate with one another. There are solutions for independent systems to communicate betweenthem (like Remote Procedure Calls, Web Services Description Language, etc.) but these method aredesigned to share information between computer systems. Also, communicating between them may bea very hard task because formats may not be compatible, and one system may not know some kind ofinformation other is sharing.

We, as humans, use language to communicate with other humans. When we need to share some kindof information that we cannot understand, we can use language to expand the other person’s knowledgeabout the subject.

One line of development for our system would be the ability of agents communicating aming them.Some problems arise from this proposal. When a agent needs to interact with other’s one, the world viewof one may not be equal to another, requiring the agents to synchronize parts of their world models. Inour system all knowledge must be annotated linguistically and therefore expressed in by language andconsequently be shared.

7.2.3 Adding new behavior by teaching the system

Manually inserting all the behavior a system can do is tiresome and leads to very unsatisfactory results.Since our system has a physical presence in our world it would be interesting to teach him new tasksby a watch-and-learn methodology or describing the task to be performed. The system already containsinformation about itself and has emotional state that can be used as a reinforcement feedback. Thiscould lead to a believable character that can learn about the world it lives in and adapts ifself to newsituations.

42

7.2.4 Supporting Multi-Language

As described in Chapter 3, our model was designed to have a multi-language capabilities. However, inthe development of our solution, it was decided that we only were going to use English (see Section 5.2.1).One possible development is adapting our Input Processing to another language. That requires a syntacticanalysis, a linguistic ontology similar to WordNet, a VerbNet and an automated speech recognition forthe language that we intend to expand.

7.2.5 Error Detection

Since our prototype was designed to support our thesis, it has little error handling and recovery capa-bilities. However, since our system is self-described, if an error occurs, it can pinpoint the part of thesystem in which the error happened.

An interesting line of development would be the system informing the user of about its internal faultsin natural language. But, since we can interact verbally, the user could assist the system in overcomingthe error by explaining what it should do.

43

Bibliography

[1] T. K. A. Lee and K. Shikano. Julius – an open source real-time large vocabulary recognition engine.In European Conference on Speech Communication and Technology, 2001.

[2] J. Allen, D. Byron, M. Dzikovska, G. Ferguson, L. Galescu, and A. Stent. An architecture for ageneric dialogue shell. In Journal of Natural Language Engineering, pages 213–228, 2000.

[3] J. Allen, G. Ferguson, and A. Stent. An architecture for more realistic conversational systems. InIn Proceedings of Intelligent User Interfaces 2001 (IUI-01, pages 1–8, 2001.

[4] J. Allen, G. Ferguson, and A. Stent. Toward conversational human-computer interaction. AI Mag.,22(4):27–37, 2001.

[5] D. M. Bikel. A distributional analysis of a lexicalized statistical parsing model. In In EMNLP, pages182–189, 2004.

[6] D. Bohus, A. Raux, T. K. Harris, M. Eskenazi, and A. I. Rudnicky. Olympus: an open-sourceframework for conversational spoken language interface research. In NAACL-HLT ’07: Proceedingsof the Workshop on Bridging the Gap, pages 32–39, Morristown, NJ, USA, 2007. Association forComputational Linguistics.

[7] H. I. Christensen, A. Sloman, G.-J. Kruijff, and J. Wyatt. Cognitive Systems. 2009.

[8] J. Dias. Fearnot! creating emotional autonomous synthetic characters for empathic interactions.Master’s thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, Lisboa, Portugal, Marco2005.

[9] J. Dias, S. Ho, B. Azenha, R. Prada, S. Enz, C. Zoll, M. Kriegel, A. Paiva, K. Dautenhahn, R. Ayllet,M. R. Gomes, and C. Ferreira. D4.1.1-d6.2.1 actor with attitude and component specification forcomputational autobiographic memory, Maio 2007.

[10] N. Diegues and D. M. de Matos. Abstracting a robotic platform for agent embodiement. 2010.Internal Report - Spoken Language Systems Laboratory, INESC-ID Lisboa, Portugal.

[11] G. Ferguson and J. F. Allen. TRIPS: An Integrated Intelligent Problem-Solving Assistant. In Proc.15th Nat. Conf. AI, pages 567–572. AAAI Press, 1998.

[12] L. Franqueira. Luıs de Camaroes - Emotional Conversational Agent. Unpublished Techinal Report,2009.

[13] S.-I. K. Hiroshi, S. Nakamura, K. Itou, S. Morishima, T. Yotsukura, A. Kai, A. Lee, Y. Ya-mashita, T. Kobayashi, K. Tokuda, K. Hirose, N. Minematsu, A. Yamada, Y. Den, T. Utsuro, andS. Sagayama. Galatea: Open-source software for developing anthropomorphic spoken dialog agents.In Life-Like Characters: Tools, Affective Functions, and Applications. Springer-Verlag, 2004.

[14] J. H. Jurafsky, Daniel & Martin. Speech and Language Processing: An Introduction to Natural Lan-guage Processing, Computational Linguistics and Speech Recognition. Prentice Hall, second edition,2008.

[15] K. Kipper, A. Korhonen, N. Ryant, and M. Palmer. A large-scale classification of english verbs.Springer-Verlag, 2008.

[16] D. Klein and C. D. Manning. Fast exact inference with a factored model for natural language parsing.In In Advances in Neural Information Processing Systems 15 (NIPS, pages 3–10. MIT Press, 2003.

44

[17] F. Martins. DIGA - Desenvolvimento de uma Plataforma para a Criacao de Sistemas de Dialogo.Master’s thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, Lisboa, Portugal,Setembro 2008.

[18] F. M. Martins, A. Mendes, M. Viveiros, J. P. Pardal, P. Arez, N. J. Mamede, and a. P. Neto, Jo˙Reengineering a domain-independent framework for spoken dialogue systems. In SETQA-NLP ’08:Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pages 68–76,Morristown, NJ, USA, 2008. Association for Computational Linguistics.

[19] G. A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38:39–41, 1995.

[20] C. Pelachaud. Embodied contextual agent in information delivering application. In In First Inter-national Joint Conference on Autonomous Agents & Multi-Agent Systems (AAMAS, pages 758–765.Press, 2002.

[21] C. Pelachaud. Multimodal expressive embodied conversational agents. In Proceedings of ACMMultimedia, pages 683–689, 2005.

[22] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pearson Education, 2003.

[23] M. Schroder. The german text-to-speech synthesis system mary: A tool for research, developmentand teaching. In International Journal of Speech Technology, pages 365–377, 2001.

[24] A. Staller, E. Staller, and P. Petta. Towards a tractable appraisal-based architecture for situatedcognizers. In Workshop: Grounding Emotions in Adaptive Systems, pages 56–61, 1998.

[25] P. Vossen. Eurowordnet: A multilingual database with lexical semantic networks, 1998.

[26] M. Wooldridge. Introduction to MultiAgent Systems. John Wiley & Sons, June 2002.

[27] M. Wooldridge and N. R. Jennings. Intelligent agents: Theory and practice. KNOWLEDGE EN-GINEERING REVIEW, 10(2):115–152, 1995.

45

Documents

Spoken Interaction with Synthetic Entities - Autenticação · Spoken Interaction with Synthetic Entities Artur David Felix Ventura Disserta˘c~ao para obten˘c~ao do Grau de Mestre