A Four-Participant Group Facilitation Framework for Conversational

Proceedings of the SIGDIAL 2013 Conference, pages 284–293,Metz, France, 22-24 August 2013. c©2013 Association for Computational Linguistics

A Four-Participant Group Facilitation Frameworkfor Conversational Robots

Yoichi Matsuyama, Iwao Akiba, Akihiro Saito, Tetsunori KobayashiDepartment of Computer Science, Waseda University

27 Waseda, Shunjuku-ku, Tokyo, Japan{matsuyama,akiba,saito}@pcl.cs.waseda.ac.jp

[email protected]

Abstract

In this paper, we propose a framework forconversational robots that facilitates four-participant groups. In three-participant con-versations, the minimum unit for multipartyconversations, social imbalance, in which aparticipant is left behind in the current conver-sation, sometimes occurs. In such scenarios, aconversational robot has the potential to facili-tate situations as the fourth participant. Conse-quently, we present model procedures for ob-taining conversational initiatives in incremen-tal steps to engage such four-participant con-versations. During the procedures, a facilitatormust be aware of both the presence of dom-inant participants leading the current conver-sation and the status of any participant that isleft behind. We model and optimize these situ-ations and procedures as a partially observableMarkov decision process. The results of ex-periments conducted to evaluate the proposedprocedures show evidence of their acceptabil-ity and feeling of groupness.

1 Introduction

We present a framework for conversational robotsthat facilitates four-participant groups with proper pro-cedures for obtaining initiatives. Figure 1 (a) de-picts a two-participant conversation. In such sit-uations, conversational contexts including floor ex-changes are commonly grounded between two inter-locutors. Many dialogue systems have dealt withsuch two-participant situations (Raux and Eskenazi,2009) (Chao and Thomaz, 2012). However, in three-participant conversations, as is shown in Figure 1(b),which is the minimum unit for multiparty conversation,floor exchanges cannot always be identified among theparticipants. Clark presented the participation struc-ture model (Clark, 1996), drawing on Goffman’s work(Goffman, 1981). In such three-participant situations,interactions between two dominant participants out ofthe three primarily occur (between participant A and B)and the other participant, who cannot properly get thefloor to speak for a long while (cannot be promoted tobe either a speaker or an addressee) tends to get left be-

hind, even though all of them are “ratified participants”considered by the current speaker.

In terms of engagement among conversational par-ticipants, Martin et al., (Martin and White, 2005) pro-posed the appraisal theory that encompasses three sub-categories, namely Attitude, Engagement, and Gradu-ation. Attitude deals with expressions of affect, judge-ment, and appreciation. Engagement focuses on lan-guage use by which speakers negotiate an interpersonalspace for their positions and the strategies which theyuses to either acknowledge, ignore, or curtail othervoices or points of view. Graduation focuses on theresources by which sparkers regulate the impact ofthese resources. Sidner et al., defined engagement as“the process by which two (or more) participants es-tablish, maintain and end their perceived connectionduring interactions they jointly undertake” (Sidner etal., 2004). Based on these previous studies, we de-fine engagement as the process establishing connec-tions among participants using dialogue actions so thatthey can represent their own positions properly. So, thethree-participant model dictates the need for one moreparticipant who helps the participant who is left behindto engage him/her in the conversation. Conversationalrobots have the potential to participate in such conver-sations as the fourth participant, as illustrated in Figure1 (c-1). Figure 1 (c-2) gives an example of the partici-pants’ speech activities in a certain duration. In this ex-ample, participant C’s activity is relatively smaller thanthat of the others, and so he/she is likely to get left be-hind in the current conversational situation for a num-ber of reasons. When a robot steps into the situation tocoordinate, there should be proper procedures in placeto obtain initiatives to control conversational contextsand to give it back to the others. If a robot naively startsto approach a participant who is left behind just aftera left-behind situation is detected, it could break thecurrent conversation. In order to coordinate situations,a facilitator (robot) must take the following procedu-ral steps: (1) Be aware of both the presence of domi-nant participants leading the current conversation andthe status of a participant who is left behind, (2) Ob-tain the initiative to control the situation and wait forapproval from the others, either explicitly or implicitly,and (3) Give the floor to a suitable participant.

Research on specially situated facilitation agents in

284

Figure 1: Types of conversations according to number of participants (dashed arrows represent their gazes): (a)Two-participant conversation model, which conventional dialogue systems have focused on. (b) Three-participantconversation model, the minimum unit for a multiparty conversation. In such multiparty conversations, socialimbalance occasionally occurs. (c-1) Four-participant conversation, with a robot that regulates the imbalancesituation, and (c-2) chart showing the unequal speech activities of the participants. In this case, participant Cappears to have less opportunity to take the floor to speak, hence, the robot is expected to help him.

multiparty conversations has been conducted by vari-ous researchers. Matsusaka et al. pioneered the act of aphysical robot participating in multiparty conversations(Matsusaka et al., 2003). We previously developed amultiparty quiz-game type facilitation system for el-derly care (Matsuyama et al., 2008) and reported on theeffectiveness of the existence of a robot (Matsuyamaet al., 2010). Dosaka et al. developed a thought-evoking dialogue system for multiparty conversationswith a quiz game task (Dohsaka et al., 2009). Theyreported that the existence of agents and empathic ex-pressions are effective for user satisfaction and increasethe number of user utterances. Bohus modeled engage-ment in multiparty conversations along Sinder’s defini-tion, namely open world dialogue (Bohus and Horvitz,2009). In terms of facilitation, Benne et al. (Benne andSheats, 1948) and Bales (Bales, 1950) pioneered inves-tigations into small group dynamics, including func-tional facilitation roles. Kumar et al. designed a dia-logue action selection model based on Bales’s Socio-Emotional Interaction Categories for text-based char-acter agents (Kumar et al., 2011).

In this paper, we propose a framework of proce-dural facilitation process to increase the total engage-ment of a group, with caring about side-effects of be-haviors at the same time. The situations and proce-dures are modeled and optimized as a partially observ-able Markov decision process (POMDP), which is suit-able for real-world sequential decision processes, in-cluding dialogue systems (Williams and Young, 2007).We begin by reviewing facilitation of small groups,and summarize requirement specifications for facilita-tion robots in the next section. In Section 3, we firstdescribe representations of small group situations andprocedures for maintaining small groups, then we dis-cuss how to model them as POMDP. In Section 4, wegive an overview of the architecture of our proposedsystem. We then discuss two experiments conducted to

verify the efficacy of the small group maintenance pro-cedures. Finally, we summarize our work and concludethis paper.

2 Facilitating Small Groups

2.1 Maintaining Small Groups

Benne et al. analyzed functional roles in small groupsto understand the activities of individuals in smallgroups (Benne and Sheats, 1948). They categorizedfunctional roles in small groups into three classes:Group task roles, Group building and maintenanceroles, and Individual roles. The Group task roles aredefined as “related to the task which the group is de-ciding to undertake or has undertaken.” Those roles ad-dress concerns about the facilitation and coordinationactivities for task accomplishment. The Group buildingand maintenance roles are defined as “oriented towardthe functioning of the group as a group.” They con-tribute to social structures and interpersonal relations.Finally, the Individual roles are directed toward theindividual satisfaction of each participant’s individualneeds. They deal with individual goals that are not rel-evant either to the group task or to group maintenance.Drawing on Benne’s work, Bales proposed interactionprocess analysis (IPA), a framework for the classifica-tion of individual behavior in a two-dimensional rolespace consisting of a Task area and a Socio-emotionalarea (Bales, 1950). The roles related to the Task areaconcern behavioral manifestations that impact the man-agement and solution of problems that a group is ad-dressing. Examples of task-oriented activities includeinitiating the floor, giving information, and providingsuggestions regarding a task. The roles related to theSocio-emotional area affect the interpersonal relation-ships either by supporting, enforcing, or weakeningthem. For instance, complementing another person toincrease group cohesion and mutual trust among mem-

285

bers is one example of positive socio-emotional behav-ior. Benne’s typology of functional roles is evaluatedas valuable with remarkable accuracy. In this paper,we employ Benne’s Group building and maintenanceroles,1 which are related to Bales’s Socio-emotionalarea, in order to arrange the following three abstractfunctional roles of group maintenance:

1. Topic Maintenance Role: Maintaining for con-flict, ideas, and topics. This person mediatesthe difference between other members, attemptsto reconcile disagreements, and relieves tensionin conflict situations. This role inherits Compro-miser, Harmonizer, and Standard setter.

2. Floor Maintenance Role: Maintaining thechance for the floor in the group in a di-rect/indirect way. This person encourages or asksquestions of the person who is not or could not getengaged in conversations, and attempts to keep thecommunication channel open. This role inheritsGatekeeper, Expediter, and Encourager.

3. Observation Role: Overlooking the conversationsituation by finding appropriate topics, observingthe motivations and moods of the participants, andcomprehending the relations between participantsin conversations. This person follows the conver-sation and comments and interprets the group’sinternal process. This role inherits Observer andcommentator and Encourager.

2.2 Procedures for Small Group Maintenance

In order that a participant who wants to claim an ini-tiative (we call this participant a “claimant”) is trans-ferred an initiative by the participant leading the currentconversation (we call this participant a “leader”), theclaimant must take procedural steps. First, the claimantmust participate in the current dominant conversationthe leader is leading, try to claim an initiative, and thenwait for either explicit or implicit approval from theleader. Let us take the example shown in Figure 2. Inthe figure, participants A and B are primarily leadingthe current conversation. Participant C cannot get thefloor to speak, and so the robot desires to give the floorto C. If the robot speaks to C directly, without beingaware of A and B, the conversation might be broken,or separated into two (A-B and C-Robot), at best. Inorder not to break the situation, the robot should par-ticipate in the dominant conversation between A and Bfirst, and set the stage such that the robot is approvedto initiate the next situation. In this paper, we definesuch a state in which a person is participating in a dom-inant conversation as a “Engaged” state, and the op-posite state as “Unengaged”. Thus, in Clark’s partic-

1Benne’s Group building and maintenance roles are Com-promiser, Harmonizer, Standard setter, Gatekeeper and ex-pediter, Encourager, Observer and commentator, and Fol-lower.

(Addresee)Robot

gaze

(Side-Participant)

Participant C

(Speaker)

Participant B

(Side-Participant)

Participant A

ENGAGED

UN-ENGAGED

ENGAGED

ENGAGED

Figure 2: Four-participant conversational group. Fourparticipants, including a robot, are talking about a cer-tain topic. Participants A and B are leading the con-versation, and mainly keep the floor. The robot alsoengages with A and B in line with the topic. C is an un-engaged participant, who does not have many chancesto take the floor for a while. The dashed arrows indicatethe direction they are facing, assuming their gazes.

ipation structure, speaker and addressee are automat-ically Engaged participants. Side-participants are di-vided into Engaged and Unengaged participants basedon their situations. In this paper, we assume that anUnengaged participant needs to respond to a Engagedparticipant’s adjacency pair part to be engaged. Adja-cency pairs are minimal units of conversation that arecomposed of two utterances by several speakers (Sche-gloff and Sacks, 1973). The speaking of the first ut-terance (the first part) provokes a responding utterance(the second part), and sometimes a third response (thethird part). Understanding adjacency pairs is, therefore,essential to detecting cut-in timing.

On the basis of our discussion above, we define thefollowing constraints for both Engaged and Unengagedparticipants when they address and shift current topics:

1. Constraint of addressing: An unengaged partici-pant must not address the other unengaged partic-ipants directly.

2. Constraint of topic shifting: An engaged partic-ipant must not shift the current topic when he/sheaddresses the other unengaged participants.

The relationship between subjective and objectiveparticipants that are permitted to approach in the twoconstraints are shown in Tables 1 and 2. In the follow-ing sections, we describe a computational model thathas the group maintenance functions discussed above.

3 Procedure Optimization3.1 Representation of Engagement StateWe assume only one speaker and one addressee existat each time-step and one or two side-participants may

286

Table 1: Permission relationship between subjectiveand objective participants for the constraint of address-ing. “Engaged” means a participant is assigned asa speaker or an addressee or a side-participant, whoengages with the conversational group. ”Unengaged”means a participant is assigned as an unengaged side-participant.

XXXXXXXXXXSubjectObjective

Engaged Unengaged

Engaged permitted permittedUnengaged permitted NOT permitted

Table 2: Permission relationship for permission be-tween subjective and objective participants in the con-straint of topic shifting.

XXXXXXXXXXSubjectObjective

Engaged Unengaged

Engaged permitted NOT permittedUnengaged NOT permitted NOT permitted

exist in four-participant conversations. We define side-participants as having two states: “Engaged” and “Un-engaged”. In the scenario shown in Figure 2, partic-ipant C may not be able to take the floor for a while.The situation probably resolves itself when the currenttopic is shifted. Hence, we define the depth of side-participant DepthSPT as the duration that a participantis assigned while the same topic continues, which rep-resents the level of engagement.

DepthSPTi = DurationSPTi/Durationtopic j (1)

UnengagedSPT =

{SPTi if DepthSPTi > T hresholdnone otherwise

(2)The suffix i represents a participant’s ID.We also define an Un-Engaged participant’s motiva-

tion to speak on the current topic. Thus, this state af-fects decision-making about topic maintenance. Theamount of motivation of a participant is calculated asa linear sum of speech activities, smiling duration, andnodding duration. Further, the motivations in our cur-rent model are heuristically assumed to be binary vari-ables.

Motivationi =

{1 if MotivAmounti > T hreshold0 otherwise

(3)

3.2 Procedure Optimization using POMDPTo optimize the procedures discussed above, wemodel the task as a partially observable Markov deci-sion process (POMDP) (Williams and Young, 2007).Formally, a POMDP is defined as a tuple β =

O

R

Timestamp t

Sh

Sm

O’

R’

Timestamp t + 1

A’p

S’h

A’sSm

Ap

As ’

Figure 3: Influence diagram representation of thePOMDP model. Circles represent random variables,squares represent decision nodes, and diamonds repre-sent utility nodes. Shaded circles indicate random vari-ables, while unshaded circles represent observed vari-ables. Solid directed arcs indicate casual effect, whiledashed directed arcs indicate that a distribution is used.

{S,A,T,R,O,Z,γ,b0}, where S is a set of states de-scribing the agent’s world, A is a set of actions thatthe agent may take, T defines a transition probabil-ity P(s′|s,a), R defines the expected reward r(s,a), Ois a set of observations the agent can receive aboutthe world, and Z defines an observation probability,P(o′|s′,a), γ is a geometric discount factor 0 < γ < 1,and b0 is an initial belief state b0(s). At each time-step,the belief state distribution b is updated as follows:

b′(s′) = γ ·P(o′|s′,a)∑s

P(s′|s,a)b(s) (4)

In this paper, we assume S can be factored into threecomponents: the participants’ engagement states Se,the participants’ motivation states Sm, and the partic-ipants’ actions Ap. Hence, the factored POMDP state Sis defined as

s = (se,sm,ap) (5)

and the belief state b becomes

b = b(se,sm,ap) (6)

To compute the transition function and observationfunction, a few intuitive assumptions are made:

P(s′|s,a) =P(s′e,s

′m,a′

p|se,sm,ap,as)

=P(s′e|sh,sm,ap,as)·

P(s′m|s′

e,sh,sm,ap,as)·P(a′

p|s′m,s′

e,se,sm,ap,as)

(7)

Figure 3 shows the influence diagram depiction of ourproposed model. We assume conditional independenceas follows: The first term in (7), which we call theparticipants’ engagement model TSe , indicates how therobot engages in the current dominant conversation at

287

each time-step. We assume that the participants’ en-gagement state at each time-step depends only on theprevious engagement state, the participants’ action, andthe system action. In this paper, the participants’ en-gagement model only contains the robot’s engagementstates because it is sufficient for the obtaining initiativesprocedures . Table 3 shows the states of engagement.

TSe = P(s′e|se,ap,as) (8)

In this paper, the probabilities of (8) were handcrafted,based on the consideration in Section 2.2 and our expe-riences. When the engagement state is the Un-Engagedstate and the robot is asked by a current speaker,the state should be changed to the Pre-Engaged state,where the robot is awaiting the speaker’s approvalfor the Engaged state. We assume that any dialogueacts from the speaker addressing the robot in the Pre-Engaged are approvals. Otherwise, the state will beback to the Un-Engaged. The Engaged state graduallygoes down to the Un-Engaged state in time-steps unlessthe robot selects any dialogue acts.

We call the second term the participants’ motivationmodel TSm . It indicates how an Un-Engaged participanthas the motivation to take the floor at each time-step.This state implies that the participant who is left be-hind (target person) has a motivation to speak on thecurrent topic. Thus, this state affects decision-makingabout topic shift. We assume that a participant’s moti-vation at each time-step depends only on the previoussystem action. The motivations are defined as an un-engaged participant’s ID and a binary (true/false) vari-able, which is calculated by (3) .

TSm = P(s′m|as) (9)

We call the third term the participants’ action modelTAp . It indicates what actions the participants are likelyto take at each time-step. We assume the participants’actions at each time-step depends on the previous par-ticipant’s action, the previous system action, and thecurrent robot’s engagement state. As shown in Table5, participants’ actions include adjacency pair types.Understanding adjacency pairs is essential to detectingcut-in timing. In this paper, we recognize the adjacencypairs only by keyword matching using the results ofspeech recognition.

TAp = P(a′p|s′

h,ap,as) (10)

The transition probabilities of adjacency pair typesare based on a corpus we collected. We recordedtwo four-participant conversational groups (all partic-ipants were human subjects), where they were talkedabout movies. The total duration was around 60 min-utes. Each utterance is segmented automatically by ourspeech recognition. After the recording, adjacency pairtypes were manually annotated for all speech segments.

We define the observation probability Z as follows:

Z = P(o′|s′,a) = P(o′|s′m,a′

p,as) (11)

Table 3: Engagement states SeEngagement states MeaningUn-Engaged The robot is not engaging with the current conversation.Pre-Engaged The robot is waiting for approval to engage with

the current conversation.Engaged The robot is engaging with the current conversation.

Table 4: Motivation states SmMotivation states MeaningMotivated The participant who is left behind has a motivation to speak

on the current topic (interested in the current topic).Not-Motivated The participant who is left behind does not have any

motivation to speak (not interested in the current topic).

Given the definitions above, the belief state can be up-dated at each time-step by substituting (8), (9), and (10)into (4).

b′(s′m,a′

p) =γ ·P(o′|s′m,a′

p,as)︸︷︷︸observation

model

·P(s′m|as)︸︷︷︸

motivationmodel

·P(a′p|s′

e,ap,as)︸︷︷︸participants’action model

·

∑sh

P(s′e|se,ap,as)︸︷︷︸

engagementmodel

·b(sm,ap)

(12)

Table 6 shows the system actions. The system hasseven actions available.

On the basis of the consideration of the constraints inSection 2.2, the reward measure includes componentsfor both the appropriateness and inappropriateness ofthe robot’s behaviors.

As an optimization algorithm, we employed theheuristic search value iteration (HSVI) algorithm pro-posed by Smith et al., which is one of point-based al-gorithms (Smith and Simmons, 2012).

4 System ArchitectureBased on the studies on small group maintenance, wepropose an architecture for conversational robots thathas the capability to facilitate small groups, as shownin Figure 4. The framework primarily comprises Sit-uation Analysis, Dialogue Management, and SentenceGeneration processes.

4.1 Situation Analysis and Dialogue ManagementEach time the system detects an endpoint of speechfrom the automatic speech recognition (ASR) module,it interprets the current situation. The Situation Analy-sis process includes participation roles recognition, ad-jacency pair part recognition, and question analysis.

Participation roles including a speaker, an addressee,and side-participants are recognized by the results ofvoice activity detection (VAD) and face directionsrecognition. The face directions are captured by depth-RGB cameras (Microsoft Kinect). In this paper, weuse a hand-crafted role classifier. The speaker classi-fication accuracy is 75.1% and the addressee classifi-

288

RGBD

ASR

Speech

Mortor

Situation

Analysis

Dialogue

Management

(POMDP)

Topic

Management

Answer Generation

Question Generation

Opinion DataStructuredData

En

vir

on

me

nt

En

vir

on

me

nt

Face Directions,Facial Expressions,and Nodding Recognition

Participation Role,AP, and QuestionAnalysis

Topic Estimation and Shifting

Gaze Control

Non-Factoid TypedFactoid Typed

User Model

Figure 4: The architecture of the system primarily comprises the Situation Analysis, the Dialogue Management,and the Sentence Generation processes. The Situation Analysis process receives sensory information from RGBDcameras (Microsoft Kinect) and speech recognizers for each participant. The Dialogue Management process isdescribed in Section 3. The Answer Generation process has the capability of doing additional phrasing with therobot’s own opinions.

Table 5: Participants’ actions ApParticipants’ actions Meaningfirst-part A participant made an adjacency part (question)second-part A participant made a second adjacency part (answer)third-part A participant made a third adjacency partother A participant asked or answered the other participantcall A participant called the robot’s name

Table 6: System actions AsSystem actions Meaninganswer Answering the current speaker’s question　　　　　question-new-topic Asking someone a question related to a new topicquestion-current-topic Asking someone a question related to the current topictrivia Giving a triviasimple-reaction Reacting simplynod Nodding to the current speakernone Doing nothing

cation accuracy is 67.2% Adjacency pairs are recog-nized by the results of the participation role recognionand speech recognition. We use a hand-crafted adja-cency pairs classifier. The classification accuracy isaround 60%, which mostly depends on the classifica-tion accuracy of addressing for a robot. In the ques-tion analysis process, a speech utterance is interpretedwith question types (5W1H interrogatives: e.g., “who,”“what,” “how,” etc.) and predicate (verbs and adjec-tives). Questions are classified into two categories:Factoid type questions and Non-factoid type questions.

In the Dialogue Management process, a dialog ac-tion is selected based on abstracted conversational situ-ation to maintain a small group, which we described inSection 3.

4.2 Sentence Generation

The Sentence Generation process consists of two com-ponents: Answer Generation and Question Genera-tion. Based on the results of the Question Analysisprocess, answers are classified into two types: Fac-toid type answers and Non-factoid type answers (opin-ions). Factoid answers are generated from a structureddatabase. In this research, we use Semantic Web tech-

nologies. After analyzing a question, it is interpretedas a SPARQL query, a resource description framework(RDF) format query language to search RDF databases.We use DBpedia as an RDF database2.

The opinion (non-factoid type answers) generationprocess refers opinion data automatically collectedfrom a large amount of reviews in the Web. The opiniongeneration consists of four process: document collec-tion, opinion extraction, sentence style conversion, andsentence ranking. As an example task, we collected re-view documents from the Yahoo! Japan Movie site 3.

The opinion extraction consists of two processes: ex-traction of evaluative expressions and classification oftheir sentiment polarities (positive/negative). We elimi-nate opinions with negative sentiments because the sys-tem is expected to talk about positive contents in ourconversational task. Nakagawa et al. (Nakagawa et al.,2008) used both a subjective evaluative dictionary (Hi-gashiyama et al., 2008) and an evaluative noun dictio-nary (Kobayashi et al., 2007). We use an evaluativeword dictionary we prepare based on their works. Inorder to extract evaluative expressions which can ap-pear at any position in a sentence, we use the IOB en-coding method, which has been commonly used forextent-identification tasks (Breck et al., 2007). UsingIOB, each word is tagged as either (B)eginning an en-tity, being (I)n an entity, or being (O)utside of an entity.Based on the proposed method by Nakagawa et al, weuse linear-chain conditional random fields (CRF) forthe IBO encoding.

In order to preserve consistency of system’s charac-ter, sentence styles are converted based on a hand-craftrule we prepare. After Japanese morphological analy-sis, punctuation marks and particular symbols and areeliminated. Then the last morpheme is converted.

We propose three ranking algorithms in terms oflength and novelty: Short, Standard and Diverse. TheShort is short length first algorithm. In this algorithm,

2http://ja.dbpedia.org/3http://movies.yahoo.co.jp

289

at first, top 30% of sentences by TF-IDF score, whichconsists of seven to ten morphemes, are extracted. Weassume top 30% of candidates is reasonably associatedwith a current topic. For the Standard and Diversealgorithms, at first, top 30% of sentences by TF-IDFscore, which consists of fifteen to twenty morphemes,are extracted. The Standard algorithm is expected tocontain substantial opinions or reasons, which can ap-peal to users about a certain topic. In this algorithm,the list is sorted by adjective term frequency. The Di-verse algorithm is expected to express opinions or rea-sons with novel styles, which can be unpredictable orsometimes serendipitous to users about a certain topic.In this algorithm, the list is sorted in the inverse orderby adjective term frequency.

4.3 Question Generation and User ModelThe Question Generation module has two main func-tions: giving someone the floor and collecting users’preferences and experiences for the User Model.

The User Model is preferred for topic maintenance.A preferred new topic is decided using cosine similar-ity of TF-IDF scores. The topic scores (TopicScore)of all topics are calculated based on cosine similari-ties of the current topic (CurrentTopic), a user’s topicpreferences of all topics (Pre f erenceTopic), and expe-riences (ExperienceTopic) between the CurrentTopicand each Topic.

TopicScorei = α cos(Topici ·CurrentTopic)

+β(

∑m

cos(Topici ·Pre f erenceTopicm)

)

+ γ(

∑n

cos(Topici ·ExperienceTopicm)

)

(13)

4.4 Experimental PlatformFor our experimental platform, we used the multimodalconversation robot ”SCHEMA([

∫e:ma]),” (Matsuyama

et al., 2009) shown in Figure 2. SCHEMA is approxi-mately 1.2[m] in height, which is the same as the levelof the eyes of an adult male sitting down in a chair.It has 10 degrees of freedom for right-left eyebrows,eyelids, right-left eyes (roll and pitch) and neck (pitchand yaw). It can express anxiousness and surprise us-ing its eyelids and control its gaze using eyes, neck,and autonomous turret. In addition, it has six degreesof freedom for each arm, which can express gestures.One degree of freedom is assigned to the mouth to in-dicate explicitly whether the robot is speaking or not.A computer is inside the belly to control the robot’sactions, and an external computer sends commands toexecute various behaviors though a WiFi network. Allmodules, including the ASRs and a speech synthesizerare connected to each other though a middleware calledthe Message-Oriented NEtworked-robot Architecture(MONEA), which we earlier produced (Nakano et al.,2006).

5 ExperimentsWe designed the following two experiments to eval-uate the appropriateness and feeling of groupness ofour proposed procedures for multiparty conversations(experiment 1), and the appropriateness of timing forinitiating procedures (experiment 2). In order to can-cel the effects of recognition errors, we prepared videorecordings of four-participant situations (Human par-ticipant A, B, C, and a robot), just like 2. We created thefollowing three conditions, all of which are optimizedas POMDP. All subjects were native Japanese speakersrecruited from Waseda University campus. They werefirst given a brief description of the purpose and theprocedure of the conversation. They were instructedthat A and B have a friendly relationship with eachother, C is coming in for the first time and is feelingnervous, therefore, C is left behind in the conversation,and a robot is trying to maximize the total engagementof this situation. We also explained “a engaged situ-ation” meant “a situation in which all participant aregiven their opportunities to speak something fairly.”

5.1 Experiment 1: Appropriateness andGroupness by Usage of Procedures

A total of 35 subjects (23 males and 12 females) par-ticipated in this experiment. The ages of the subjectsranged between 20 and 25 years with an average age of20.5 years. After they watched the videos, they wereasked to complete questionnaires about their feeling ofgroupness (“For which condition did you feel a senseof groupness?”) and free-form questionnaires. The fol-lowing four conditions were videotaped, and the videoedited at around 30 s. All videos contained the sametopic (“Princess Mononoke”). The spatial arrangementwas the same as shown in Figure 2.

Condition 1: Without procedures (without topicshifting). A robot directly asks a participant left be-hind without procedures claiming an initiative. As isshown in Figure 8, just after a sequence of interactionsbetween A and B, which is segmented by a third ad-jacency pair part, a robot directly asks C. The topic isstill maintained (“Princess Mononoke”).

Condition 2: With procedures (without topic shift-ing). A robot directly asks a participant left behind witha procedure. As is shown in Figure 9, Just after a se-quence of interactions between A and B, a robot asksA with the first pair part and waits for A’s response (thesecond part). Then it finishes the interaction with A,and asks C to give a floor. In this case, topic is stillmaintained the current one (“Princess Mononoke”).

Condition 3: Without procedures + topic shifting. In#6 question of Condition 1 (Figure 8), a robot initiatesa new topic (“From Up On Poppy Hill”) instead.

Condition 4: With procedures + topic shifting. In#7 question of Condition 2 (Figure 9), a robot initiatesa new topic (“From Up On Poppy Hill”) instead.

After watching the movies, they were requested toanswer Likert 7-scaled questionnaires about (a) appro-

290

0

2

4

5

7

(3) w/o procedure(4) w/ procedure

0

2

4

5

7

p < 0.01*


p < 0.01*

(a) w/o topic shifting (b) w/ topic shifting

Figure 5: Result of experiment 1-a

0

2

4

5

7

0

2

4

5

7


p < 0.01*


p < 0.01*

(a) w/o topic shifting (b) w/ topic shifting

Figure 6: Result of experiment 1-b

0

2

4

5

7

(a) w/o topic shifting

(1) first part (2) second part (3)NoAP

0

2

4

5

7

(b) w/ topic shifting

* : p < 0.01

* *

ns

* *

ns

Figure 7: Result of experiment 2

priateness of procedures, (b) Feeling of groupness.

5.2 Experiment 2: Appropriateness of Timing ofInitiating Procedures

A total of 32 subjects (21 males and 11 females) par-ticipated in this experiment. The ages of the sub-jects ranged between 20 and 25 years with an aver-age age of 20.5 years. After they watched the videos,they were asked to complete questionnaires about thetiming of the initiating procedures (“Which video didyou feel was the most appropriate?”). The followingthree conditions were videotaped, and edited at around30 s. All videos contained the same topic (“PrincessMononoke”). The spatial arrangement was the same asshown in Figure 2. We created three conditions:

Condition 1 (first part): Initiating a procedure justafter the first adjacent pair part.

Condition 2 (second part): Initiating a procedurejust after the second adjacent pair part.

Condition 3 (No AP): Out of consideration of adja-cency pairs.

In conditions 1 and 2, the robot initiated its proce-dures just after the first and second parts, respectively.In condition 3, the robot initiated its procedure in themiddle of the adjacency pairs, which is intended toshow that the robot does not care about adjacency pairs.We did not consider the timings of the third part of theadjacency pair because we had already examined theappropriateness of the timing of the third part in ex-periment 1. After watching the movies, they were re-quested to answer Likert 7-scaled questionnaires aboutthe robot’s appropriateness of behavior.

5.3 Results and Discussions

Figure 5 shows usages of procedures are appropriate toapproach a participant left behind either with or withouttopic shifting. The t-test result shows a significant dif-ference between condition 1 and 2, as well as between3 and 4 (p < 0.01). Figure 6 shows usages of proce-dures generate feelings of groupness. The t-test resultalso shows a significant difference between condition 1and 2, as well as between 3 and 4 (p < 0.01).

Figure 7 (a) shows initiating procedures withouttopic shifting in timings of just after the second pair

parts is more appropriate than other conditions. The re-sult of an analysis of variance (ANOVA) shows signif-icant differences among conditions (F [2,26] = 34.46,p < 0.01). The result of multiple comparisons with theTukey HSD method shows a significant difference be-tween condition 1 and 2, as well as between 2 and 3(p < 0.01). Figure 7 (b) shows initiating procedureswith topic shifting in timings of just after the secondpair parts is more appropriate than other conditions.The result of an analysis of variance (ANOVA) showssignificant differences among conditions (F [2,26] =42.52, p < 0.01). The result of multiple comparisonswith the Tukey HSD method shows a significant differ-ence between condition 1 and 2, as well as between 2and 3 (p < 0.01).

From these results, usages of procedures obtaininginitiatives before approaching a participant left behindshowed evidences of acceptability as a participant’s be-haviors, and feeling of groupness in a group. As fortimings, initiating the procedures just after the secondor third adjacency pair parts is felt more appropriatethan the first pairs by participants.

6 ConclusionsWe proposed a framework for conversational robotsfacilitating four-participant groups. Based on a rep-resentation of conversational situations, we presenteda model of procedures obtaining conversational initia-tives in incremental steps to maximize total engage-ment of such four-participant conversations. These sit-uations and procedures were modeled and optimizedas a partially observable Markov decision process. Asthe results of two experiments, usages of proceduresobtaining initiatives showed evidences of acceptabilityas a participant’s behaviors, and feeling of groupness.As for timings, initiating the procedures just after thesecond or third adjacency pair parts is felt more appro-priate than the first pairs by participants.

7 AcknowledgementsThis research was supported by the Grant-in-Aid for scientific research WAKATE-B (23700239).TOSHIBA corporation provided the speech synthesizercustomized for our spoken dialogue system.

291

ReferencesRobert F Bales. 1950. Interaction process analysis.

Cambridge, Mass.

Kenneth D Benne and Paul Sheats. 1948. Functionalroles of group members. Journal of social issues,4(2):41–49.

Dan Bohus and Eric Horvitz. 2009. Models for multi-party engagement in open-world dialog. In Proceed-ings of the SIGDIAL 2009 Conference: The 10th An-nual Meeting of the Special Interest Group on Dis-course and Dialogue, pages 225–234. Associationfor Computational Linguistics.

Eric Breck, Yejin Choi, and Claire Cardie. 2007. Iden-tifying expressions of opinion in context. In Pro-ceedings of the 20th international joint conferenceon Artifical intelligence, pages 2683–2688. MorganKaufmann Publishers Inc.

Crystal Chao and Andrea Lockerd Thomaz. 2012.Timing in multimodal turn-taking interactions: Con-trol and analysis using timed petri nets. Journal ofHuman-Robot Interaction, 1(1).

Herbert H Clark. 1996. Using language, volume 4.Cambridge University Press Cambridge.

Kohji Dohsaka, Ryota Asai, Ryuichiro Higashinaka,Yasuhiro Minami, and Eisaku Maeda. 2009. Ef-fects of conversational agents on human communi-cation in thought-evoking multi-party dialogues. InProceedings of the SIGDIAL 2009 Conference: The10th Annual Meeting of the Special Interest Groupon Discourse and Dialogue, pages 217–224. Asso-ciation for Computational Linguistics.

Erving Goffman. 1981. Forms of talk. University ofPennsylvania Press.

Masahiko Higashiyama, Kentaro Inui, and Yuji Mat-sumoto. 2008. Acquiring noun polarity knowledgeusing selectional preferences. In Proceedings of the14th Annual Meeting of the Association for NaturalLanguage Processing, pages 584–587.

Nozomi Kobayashi, Kentaro Inui, and Yuji Matsumoto.2007. Opinion mining from web documents: Ex-traction and structurization. Information and MediaTechnologies, 2(1):326–337.

Rohit Kumar, Jack L Beuth, and Carolyn P Rose. 2011.Conversational strategies that support idea genera-tion productivity. In in Groups, 9 th Intl. Conf. onComputer Supported Collaborative Learning, HongKong 160 and Rose, 2010a) Rohit Kumar, Carolyn P.Rose, 2010, Conversational Tutors with Rich Inter-active Behaviors that support Collaborative Learn-ing, Workshop on Opportunity. Citeseer.

James R Martin and Peter RR White. 2005. Thelanguage of evaluation. Palgrave Macmillan Bas-ingstoke and New York.

Yosuke Matsusaka, Tojo Tsuyoshi, and TetsunoriKobayashi. 2003. Conversation robot participatingin group conversation. IEICE transactions on infor-mation and systems, 86(1):26–36.

Yoichi Matsuyama, Hikaru Taniyama, Shinya Fujie,and Tetsunori Kobayashi. 2008. Designing commu-nication activation system in group communication.In Humanoid Robots, 2008. Humanoids 2008. 8thIEEE-RAS International Conference on, pages 629–634. IEEE.

Yoichi Matsuyama, Kosuke Hosoya, Hikaru Taniyama,Hiroki Tsuboi, Shinya Fujie, and TetsunoriKobayashi. 2009. Schema: multi-party interaction-oriented humanoid robot. In ACM SIGGRAPHASIA 2009 Art Gallery & Emerging Technologies:Adaptation, pages 82–82. ACM.

Yoichi Matsuyama, Shinya Fujie, Hikaru Taniyama,and Tetsunori Kobayashi. 2010. Psychological eval-uation of a group communication activation robot ina party game. In Eleventh Annual Conference of theInternational Speech Communication Association.

Tetsuji Nakagawa, Takuya Kawada, Kentaro Inui, andSadao Kurohashi. 2008. Extracting subjective andobjective evaluative expressions from the web. InUniversal Communication, 2008. ISUC’08. SecondInternational Symposium on, pages 251–258. IEEE.

Teppei Nakano, Shinya Fujie, and TetsunoriKobayashi. 2006. Monea: message-orientednetworked-robot architecture. In Robotics andAutomation, 2006. ICRA 2006. Proceedings 2006IEEE International Conference on, pages 194–199.IEEE.

Antoine Raux and Maxine Eskenazi. 2009. A finite-state turn-taking model for spoken dialog systems.In Proceedings of Human Language Technologies:The 2009 Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics, pages 629–637. Association for Computa-tional Linguistics.

Emanuel A Schegloff and Harvey Sacks. 1973. Open-ing up closings. Semiotica, 8(4):289–327.

Candace L Sidner, Cory D Kidd, Christopher Lee, andNeal Lesh. 2004. Where to look: a study of human-robot engagement. In Proceedings of the 9th in-ternational conference on Intelligent user interfaces,pages 78–84. ACM.

Trey Smith and Reid Simmons. 2012. Point-basedpomdp algorithms: Improved analysis and imple-mentation. arXiv preprint arXiv:1207.1412.

Jason Williams and Steve Young. 2007. Partiallyobservable markov decision processes for spokendialog systems. Computer Speech & Language,21(2):393–422.

292

# SPK → ADD AP Sentences1 A→B First Have you ever watched “Princess Mononoke”?2 B→A Second Yes, I have3 A→B First Oh, you have?4 B→A Second Yeah.5 A→B Third I see6 R→C First Have you ever watched “Princess

Mononoke”?7 C→R Second Yes, I have

Figure 8: Transcript of condition 1 (experiment 2)

# SPK → ADD AP Sentences1 A→B 1st Have you ever watched “Princess Mononoke”?2 B→A Second Yes, I have3 A→B Third I see.4 R→A First It is one of my favorite movies among Ghibri’s5 A→B Second Really?6 B→A Third Yes.7 R→C First Have you ever watched “Princess

Mononoke”?8 C→R Second Yes, I have

Figure 9: Transcript of condition 2 (experiment 2)

1 2 3

4 5 6

# SPK → ADD AP Se Sentences

(Topic: “007 Skyfall”)

1 A→B 1st Un Let’s talk about the “Skyfall.”

2 A→B 1st Un Have you ever seen the latest one?

3 B→A 2nd Un Well, I’ve not seen that.

4 A→B 3rd Un Oh, really.

5 R→A 1st Pre Well, I like the Bond Girl.

6 A→R 2nd Pre I see.

7 R→A 1st Pre I think that movie is good because of the setting of the ”old age” for the 44-year

old James Bond.

8 A→R 2nd H Uh-huh.

(R is approved to obtain an initiative)

9 R→A 3rd H Yes.

10 R→C 1st H Have you ever seen the ”Skyfall”?

11 C→R 2nd H No, I haven’t.

12 A→C 1st H Oh, you haven’t seen it?

13 C→A 2nd H I never seen that before.

1

2

3

4

5

6

Figure 10: Interaction scenes. The “AP” signifies adjacency pair types. At #4, the system recognized A’s adjacencythird part and then generated a spontaneous opinion addressed to A (#5) as the first part. At that point, the systemassumed the state of engagement (Se) had changed from Un-Engaged to Pre-Engaged. After the system observedA’s second part at #8, it assumed it at gotten approval to obtain an initiative to control the context (Engaged). At#10, the robot asked C a question in order to give him the floor.

293

Documents

A Four-Participant Group Facilitation Framework for Conversational