Designing a reconfigurable Multimodal and Collaborative Supervisorfor Virtual Environment
Pierre Martin Patrick Bourdot
V&AR VENISE team, CNRS/LIMSI, B.P 133, 91403 Orsay (France)
Virtual Reality (VR) systems cannot be promoted for complexapplications (involving the interpretation of massive and intricatedatabases) without creating natural and transparent user inter-faces: intuitive interfaces are required to bring non-expert usersto use VR technologies. Many studies have been carried out onmultimodal and collaborative systems in VR. Although these twoaspects are usually studied separately, they share interesting simi-larities. Our work focuses on the way to manage multimodal andcollaborative interactions in a same process. We present here thesimilarities between these two processes and the main features of areconfigurable multimodal and collaborative supervisor for VirtualEnvironments (VEs). The aim of such a system is to ensure themerge of pieces of information coming from VR devices (tracking,gestures, speech, haptics, etc.), to control immersive multi-user ap-plications using the main communication and sensorimotor chan-nels of humans. The frameworks architecture of this supervisorwants to be generic, modular and reconfigurable (via an XML con-figuration file), in order to be applied to many different contexts.
Index Terms: H.1.2 [Models and Principles]: User/MachinesSystemsHuman information processing; H.5.1 [Information In-terfaces and Presentation]: Multimedia Information SystemsArtificial, augmented, and virtual realities; H.5.2 [Information In-terfaces and Presentation]: MUser InterfacesUser interface man-agement systems; I.3.6 [Computer Graphics]: Methodology andTechniquesInteraction techniques
Efforts are still needed to study the use of Virtual Environments(VEs) for complex applications (product design, data exploration,etc.). But VEs cannot be promoted for such applications withoutcreating natural and transparent user interfaces: intuitive inter-faces could bring non-expert users to use Virtual Reality (VR) tech-nologies, by exploiting human modalities. Moreover, these modal-ities have to be simultaneously perceived and interpreted, hence theneed for multimodal interfaces in VEs. Multimodality has been agrowing field of Computer Science since the eighties and impor-tant concepts have been addressed, especially by Bolt , Oviattet al. , Martin , Bellik  and Latoschik . Additionally,managing groups of users in Collaborative VEs (CVEs), involvedin collaborative tasks is more and more required (e.g. Margery etal.  and Salzmann et al. ). Many studies focus on multi-modal or collaborative interactions in Virtual Environments, but toour knowledge none covers these topics in conjunction. This iswhat motivated the present work, further described in .
e-mail: firstname.lastname@example.org: email@example.com
2 SIMILARITIES BETWEEN MULTIMODAL AND COLLABORA-TIVE INTERACTIONS
Several critical points of the multimodal process can be inferredfrom studies on multimodal interaction: disambiguation (compar-ing users inputs), decision (generating actions desired by a user)and dialog (decision management, identifying additional treatmentsto be applied on incoming actions). Previous multimodal ap-proaches have been carried out on single user and non-cooperativescenarios. But now, the improvements of VR technology allow, forinstance, cooperation of co-located users into a shared virtual scene.Therefore the multi-user context is a key point in the multimodalprocess. Such a system must be able to handle interactions fromseveral users and thus one must know when data fusion may occur,but also what data can be merged. What happens if pieces of infor-mation coming from several users have to be combined ? At whichlevel, the pieces of information have to be merged ? These new im-portant questions about multi-user multimodal systems reveal thatthere are similarities between collaborative and multimodal pro-cesses. We should not restrict the notion of collaboration onlyto simultaneous activities, but we need a broader view includingthe notions of dialog and continuity. It is not solely a question ofsolving some constraints at a given time, but rather of an ongoingmanagement over time. Thus, to bring closer these two processes,we have chosen to extend the classification for collaborative inter-actions proposed by . Actually, the previous classification ofMargery et al. was mainly focusing on users collaborative manip-ulations. Our extended classification is task-based oriented, whichis more appropriate to a multimodal approach:
Basic cooperation (level 1) : users can perceive each otherand can communicate. This level is composed of two generalsituations:
(a) co-located interaction : users are immersed in the sameplace or in the same display. Depending on the technologyused (full or limited cohabitation), they can have natural com-munication. For instance, in a multi-user CAVE with multi-stereoscopy visualization system, users have a full cohabita-tion. They can see or touch each other and can have naturaldialog. In case of HMD systems, users have a limited cohab-itation. They are able to have natural conversations, but theydo not see others physically.
(b) remote interactions : users are immersed in a virtual worldbut are distant (virtual presence). Therefore, multimedia com-munication and avatars are required.
Parallel tasks (level 2) : the notion of task includes manipula-tion of scenes objects (cf. Margerys classification) but alsocommand aspect to control the application. Here, the userscan act on the scene individually: a given task is performedby only one user. This level is divided in two subdivisions, (a)constrained tasks (by scene design) and (b) free tasks. Usersinteractions are completely independent of each other, what-ever the different processes.
Cooperative tasks (level 3) : users can cooperate on the same
IEEE Virtual Reality 2011
19 - 23 March, Singapore
978-1-4577-0038-5/11/$26.00 2011 IEEE
Figure 1: The frameworks architecture. The Multimodal and Collaborative Supervisor (MCS) is configured and initialized using XML files. MCSprocesses users inputs, taking into account rules and context, and transmits results to the application.
object or within the same task. This level is divided in threesub-divisions:
(a) independent tasks : users generate tasks with a similar tar-get (but independent properties) which can be performed in-dependently of each other.
(b) synergistic task : users interactions can be combined, atany level of treatment, to generate one single task. This con-cept includes the previous co-manipulations and is close to thecomplementarity concept of Martin .
(c) co-dependent task : users generate dependent tasks (i.e.similar task or competitive task on a same target). These co-dependent tasks can not be performed immediately becauseambiguities have to be solved. With the similar tasks comesa concept of redundancy, also close to the one of Martin .
It is obvious that level 1 of collaboration does not depend on thesystem used but rather on technology. But we have also noticed thatlevel 2 and level 3 could be introduced in a multimodal process:merging separately events of each user ensures the level 2, whilecombining events of all users ensures level 3. This is one of the keypoints of our system.
3 OVERVIEW OF THE MAIN FEATURES OF THE MCS ARCHI-TECTURE
Our work focuses on the design of a reconfigurable multimodal andcollaborative supervisor (MCS) for VE applications (see Fig. 1). Itperforms late fusion  to integrate at a semantic level, informa-tions coming from several users. The MCS provides several com-ponents required for its complete integration with a VR applica-tion/platform: an input interface (interpreters), a processing core(interpretation fusion, argument fusion, command manager) and anoutput interface (command manager). The input interface pro-cesses users inputs, the MCS core handles multimodal and collabo-rative treatments and the output interface packages the final com-mands. Thanks to its splitting in four stages, the MCS covers thedisambiguation, decision and dialog phases (three critical pointsof collaborative and multimodal processes) and addresses equiva-lence, redundancy and complementarity .
We are conducting research on VR interaction with industrialpartners in order to evaluate the potential use of the proposed ap-proach. We applied our supervisor to a collaborative situation, en-titled MalCoMIICs, for multimodal and co-localized multi-user in-teractions for immersive collaborations . This application usesthe EVE system1, a new multi-user and multi-sensorimotor CAVE-like set-up.
REFERENCES Y. Bellik. Media integration in multimodal interfaces. In Proc. of the
IEEE Workshop on Multimedia Signal Processing, pages 3136, 1997. R. A. Bolt. put-that-there: Voice and gesture at the graphics interface.
In SIGGRAPH 80: Proc. of the 7th annual conference on Computergraphics and interactive techniques, pages 262270, New York, NY,USA, 1980. ACM.
 M. E. Latoschik. A user interface framework for multimodal vr in-teractions. In ICMI 05: Proc. of the 7th international conference onMultimodal interfaces, pages 7683, New York, NY, USA, 2005. ACM.
 D. Margery, B. Arnaldi, and N. Plouzeau. A general framework forcooperative manipulation in virtual environments. In Virtual Environ-ments, volume 99, pages 169178, 1999.
 J.-C. Martin. Tycoon: Theoretical framework and software tools formultimodal interfaces. In In John Lee (Ed.), Intelligence and Multi-modality in Multimedia Interfaces. AAAI Press, 1998.
 P. Martin, P. Bourdot, and D. Touraine. A reconfigurable architecturefor