17
William Steptoe* Department of Computer Science University College London London WC1E 6BT UK Jean-Marie Normand Event-Lab University of Barcelona Oyewole Oyekoya Fabrizio Pece Department of Computer Science University College London London WC1E 6BT UK Elias Giannopoulos Event-Lab University of Barcelona Franco Tecchia PERCRO Scuola Superiore Sant’Anna Anthony Steed Tim Weyrich Jan Kautz Department of Computer Science University College London London WC1E 6BT UK Mel Slater Department of Computer Science University College London London WC1E 6BT UK and ICREA University of Barcelona Presence, Vol. 21, No. 4, Fall 2012, 406–422 © 2012 by the Massachusetts Institute of Technology Acting Rehearsal in Collaborative Multimodal Mixed Reality Environments Abstract This paper presents the use of our multimodal mixed reality telecommunication sys- tem to support remote acting rehearsal. The rehearsals involved two actors, located in London and Barcelona, and a director in another location in London. This triadic audiovisual telecommunication was performed in a spatial and multimodal collabo- rative mixed reality environment based on the “destination-visitor” paradigm, which we define and put into use. We detail our heterogeneous system architecture, which spans the three distributed and technologically asymmetric sites, and features a range of capture, display, and transmission technologies. The actors’ and director’s expe- rience of rehearsing a scene via the system are then discussed, exploring successes and failures of this heterogeneous form of telecollaboration. Overall, the common spatial frame of reference presented by the system to all parties was highly conducive to theatrical acting and directing, allowing blocking, gross gesture, and unambiguous instruction to be issued. The relative inexpressivity of the actors’ embodiments was identified as the central limitation of the telecommunication, meaning that moments relying on performing and reacting to consequential facial expression and subtle gesture were less successful. 1 Introduction The Virtuality Continuum (Milgram & Kishino, 1994) describes vir- tual reality (VR) -related display technologies in terms of their relative extents of presenting real and virtual stimuli. The spectrum ranges from the display of a real environment (e.g., video) at one end, to a purely synthetic virtual environment (VE) at the other. Mixed reality (MR) occupies the range of the continuum between these extrema, merging both real and virtual objects. MR was originally considered on a per display basis, and later broadened to consider the joining together of distributed physical locations to form MR environments (Benford, Greenhalgh, Reynard, Brown, & Koleva, 1998). When discussing MR displays, it is sufficient to do so with regard to Milgram and Kishino’s evolving taxonomy (Milgram & Kishino, 1994) that ranges from handheld devices through to immersive projection technologies (IPTs) such as the CAVE and head-mounted displays (HMDs). However, MR environments as outlined by Benford et al. (1998) bring together a range of technologies, including *Correspondence to [email protected]. 406 PRESENCE: VOLUME 21, NUMBER 4

Department of Computer Science Multimodal Mixed Reality ...€¦ · spans the three distributed and technologically asymmetric sites, and features a range of capture, display, and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • William Steptoe*Department of Computer ScienceUniversity College LondonLondon WC1E 6BT UK

    Jean-Marie NormandEvent-LabUniversity of Barcelona

    Oyewole OyekoyaFabrizio PeceDepartment of Computer ScienceUniversity College LondonLondon WC1E 6BT UK

    Elias GiannopoulosEvent-LabUniversity of Barcelona

    Franco TecchiaPERCROScuola Superiore Sant’Anna

    Anthony SteedTim WeyrichJan KautzDepartment of Computer ScienceUniversity College LondonLondon WC1E 6BT UK

    Mel SlaterDepartment of Computer ScienceUniversity College LondonLondon WC1E 6BT UKandICREAUniversity of Barcelona

    Presence, Vol. 21, No. 4, Fall 2012, 406–422

    © 2012 by the Massachusetts Institute of Technology

    Acting Rehearsal in CollaborativeMultimodal Mixed RealityEnvironments

    Abstract

    This paper presents the use of our multimodal mixed reality telecommunication sys-tem to support remote acting rehearsal. The rehearsals involved two actors, locatedin London and Barcelona, and a director in another location in London. This triadicaudiovisual telecommunication was performed in a spatial and multimodal collabo-rative mixed reality environment based on the “destination-visitor” paradigm, whichwe define and put into use. We detail our heterogeneous system architecture, whichspans the three distributed and technologically asymmetric sites, and features a rangeof capture, display, and transmission technologies. The actors’ and director’s expe-rience of rehearsing a scene via the system are then discussed, exploring successesand failures of this heterogeneous form of telecollaboration. Overall, the commonspatial frame of reference presented by the system to all parties was highly conduciveto theatrical acting and directing, allowing blocking, gross gesture, and unambiguousinstruction to be issued. The relative inexpressivity of the actors’ embodiments wasidentified as the central limitation of the telecommunication, meaning that momentsrelying on performing and reacting to consequential facial expression and subtlegesture were less successful.

    1 Introduction

    The Virtuality Continuum (Milgram & Kishino, 1994) describes vir-tual reality (VR) -related display technologies in terms of their relative extentsof presenting real and virtual stimuli. The spectrum ranges from the displayof a real environment (e.g., video) at one end, to a purely synthetic virtualenvironment (VE) at the other. Mixed reality (MR) occupies the range of thecontinuum between these extrema, merging both real and virtual objects. MRwas originally considered on a per display basis, and later broadened to considerthe joining together of distributed physical locations to form MR environments(Benford, Greenhalgh, Reynard, Brown, & Koleva, 1998). When discussingMR displays, it is sufficient to do so with regard to Milgram and Kishino’sevolving taxonomy (Milgram & Kishino, 1994) that ranges from handhelddevices through to immersive projection technologies (IPTs) such as the CAVEand head-mounted displays (HMDs). However, MR environments as outlinedby Benford et al. (1998) bring together a range of technologies, including

    *Correspondence to [email protected].

    406 PRESENCE: VOLUME 21, NUMBER 4

  • Steptoe et al. 407

    situated and mobile displays and capture devices, withthe aim of supporting high-quality spatial telecommu-nication. The work presented in this paper concernsa system based on an emerging mode of telecommu-nication that we refer to as the “destination-visitor”paradigm. We assess our distributed heterogeneous MRenvironment (that also features MR displays) through acase study of remote acting rehearsal.

    Telecommunication has long been a driving forcebehind the development of MR- and VR-relatedtechnologies. VR’s flexible, immersive, and spatial char-acteristics provide several opportunities over commonmodes of audiovisual remote interaction such as videoconferencing. In the case of remote acting rehearsal,the use of a nonimmersive VE allowing actors to bothcontrol their own avatars and observe others within ashared virtual space was found to be able to form a basisfor successful live performance that may not have beenachievable using minimal video conferencing (Slater,Howell, Steed, Pertaub, & Garau, 2000). Another studyinvestigating object-focused puzzle solving found thatpairs of participants interacting between remote IPTswere able to perform the task almost as well as whenthey were collocated, and significantly better than whenone participant used an IPT while the other used a non-immersive desktop system (Schroeder et al., 2001).This latter arrangement in Schroeder et al.’s study is anexample of asymmetric telecommunication, in the sensethat the technological experience of the two interact-ing participants was quite different: one was immersedin fully-surrounding IPT featuring body tracking andperspective-correct stereoscopic rendering, while theother used standard input devices on a 2D consumerdisplay.

    Asymmetry pertains both to the work presented inthis paper, and to an ongoing tension between the tech-nological asymmetries often intrinsic to media spacesand our natural desire for social symmetry between par-ticipants (A. Voida, S. Voida, Greenberg, & He, 2008).Traditionally, media space research has striven for tech-nological symmetry: an aim that is likely borne from ourdaily experience of reciprocity in collocated face-to-faceinteraction, in which the same sensory cues and affor-dances are generally available to all participants (i.e.,

    social symmetry). Studies in the VE literature illustratethat technological asymmetry affects the social dynam-ics of virtual interaction. Indeed, participants interactingover a technologically asymmetric system are unable tomake discriminating judgments about joint experience(Axelsson et al., 1999). Immersing one participant toa greater degree than their interactional partner makesit significantly more likely for the more-immersed par-ticipant to be singled out as the social leader (Slater,Sadagic, Usoh, & Schroeder, 2000), and for the less-immersed participant to be evaluated as contributingless to cooperative tasks (Schroeder et al., 2001). Thecorrelation between technological and social sym-metry has significant implications for the conduct ofboth personal and business communication, and this islikely to account for the traditional desire to design fortechnological symmetry.

    Opposing this tradition, recent innovations in MRtelepresence research are embracing technological asym-metry. This trend has arisen naturally, due to increaseddistribution among teams that are mostly collocatedexcept for one or two members. Acknowledging thisgeographical distribution in personnel, recent researchhas focused particularly on how to present remote par-ticipants in MR environments. This has been achievedthrough the use of both situated displays (Venolia et al.,2010) and mobile personal telepresence robots (Tsui,Desai, Yanco, & Uhlik, 2011). Technology used in suchMR environments aims to augment a place with virtualcontent that provides communicational value to theconnected group members. Thus, the central conceptof this paradigm involves people at a destination (alsotermed a hub, Venolia et al., 2010; Tsui et al., 2011)that is augmented with technologies aiming to both rep-resent and bestow visitors (also termed satellites, Venoliaet al., 2010; or spokes, Tsui et al., 2011) with both phys-ical and social presence to support high-quality remoteinteraction.

    Throughout this paper, we refer to such systems asbeing examples of the destination-visitor paradigm, andthe system that we detail in the following section is anarchetype of this paradigm. Studies of systems adoptingthe destination-visitor paradigm have so far focused onoptimizing the social presence of a visitor for the benefit

  • 408 PRESENCE: VOLUME 21, NUMBER 4

    of those at the destination, but have focused less on theexperience of that visitor. Consequently, while the visitoris often represented at the destination with a high degreeof physical presence (e.g., as a mobile telepresence robotor a situated display), the destination and its collocatedpeople are not represented to the visitor with an anal-ogous level of fidelity and spatiality. Hence, in systemssuch as those presented in Venolia et al. (2010) and Tsuiet al. (2011), social asymmetry is likely to arise from thisunequal consideration of locals’ and visitors’ sensoryexperiences that arise due to the differing technologicalquality over the two sites.

    Our approach to the destination-visitor paradigm pre-sented in this paper aims to provide a more consistentdegree of social symmetry to all engaged parties inter-acting via the multimodal and highly technologically-asymmetric system. Hence, a visitor’s visual experienceof both the destination and of its copresent locals is pro-vided by a range of high-fidelity immersive MR displaymodes viewed from a first-person perspective. Mean-while, at the destination site, the visitor is representedusing a range of display modes that allow both full-body gesture and movement to be observed, and spatialattention to be naturally perceived. The system aimsto provide an audiovisual telecommunication platformfeaturing both communicational and spatial consis-tency between all engaged parties, whether they arelocals or visitors, thereby fostering social symmetry inan otherwise technologically asymmetric system.

    The following section details our system topology,which grants a flexible approach to the destination-visitor paradigm and supports a variety of synchronousnetworked capture and display technologies. The act-ing rehearsals are then discussed, exploring the actors’and director’s experience of performing together via thesystem. We decided to study acting, as opposed to puresocial interaction and collaboration, which is the princi-pal application of telecommunication systems, for severalreasons. The aim of a theatrical rehearsal is to practiceand refine the performance of a scene. We chose a sceneconsisting of varied spatial and interpersonal interplaybetween two characters. Thus, the actors engage in richconversation, spatial action, directing attention, andhandling objects. Such activity is commonplace in the

    type of unplanned social interaction and collaborativework that the system aspires to support. Hence, throughrepeated and evolving run-throughs of the scene withprofessional actors and a director, this structured activityforms an excellent basis for analytic knowledge regard-ing the successes and failures of the system. Throughoutthis paper, we are primarily concerned with the visualmode of telecommunication. While there are severalother sensory cues we are actively working on, includingspatialized audio and touch, these are scoped for futurework.

    This work is a continuation of a previous setup andstudy that investigated acting in an immersive collab-orative VE (Normand et al., 2012). The current paperintroduces the destination-visitor paradigm, whichpresents a highly asymmetric multimodal collaborativeMR environment.

    2 System Architecture

    This section presents a high-level overview of ourflexible and heterogeneous system architecture basedon the destination-visitor paradigm. In Section 3, weinvestigate how well the system is able to support geo-graphically remote acting rehearsal between two actorsand one director. Hence, the setup we present is tailoredto suit this spatially and visually dependent applicationscenario. It should be stated that the technologicalarrangement is flexible, and different acquisition anddisplay devices, and also visitor sites, may be introducedor removed. Figure 1 illustrates the distinct arrange-ments at each of the three sites in our particular studiedsetup: the physical destination, where one actor willbe located, is equipped with a range of capture anddisplay technologies; the visitor site, at which the sec-ond actor will be located, is composed of an immersiveHMD-based VR system with full-body tracking; andthe director’s setup is an immersive CAVE-like sys-tem, although it could be a standard machine locatedanywhere.

    In this triadic interaction, there are two people (thevisiting actor and the director) who may be classed,according to the paradigm’s specifications, as visitors.

  • Steptoe et al. 409

    Figure 1. Asymmetric technical arrangements at the three sites to acquire and display places andpeople. The destination and visitor sites feature varying acquisition and display technologies, while

    the director (spectator) site only requires display technology.

    However, the director is more comfortably classified asa spectator, as he is not visually represented to eitheractor. This enables the director to navigate throughthe rehearsal space without causing visual distractionor occlusion to the two actors. Acquisition and displaytechnologies at each site are now discussed, followedby an overview of the transmission protocols operatingbetween sites for the various media streams. Acquisitionat the destination site dictates the display characteristicsat the visitor site and vice versa. We focus solely on thevisual components of the system, as aural communica-tion is supported using Skype.1 We identify spatialized3D audio as an area of future work.

    1. http://www.skype.com

    2.1 Destination Site

    The destination site is a physical meeting room,approximately 5 × 3 × 3 m, where: the destination actoris physically located, the visiting actor will remotely per-ceive and be virtually represented, and the director willremotely perceive including virtual representations ofboth actors. Hence, the destination must be equippedwith technology able to acquire both the environmentand the copresent actor in order to transmit to the visi-tor and director sites, and also with display technologyto represent the visiting actor. The unique feature of thedestination is that it should largely remain a standardmeeting room to any collocated people, in the sensethat they should not be encumbered by worn devices

  • 410 PRESENCE: VOLUME 21, NUMBER 4

    Figure 2. The three display modes available at the visitor site from left to right: spherical video from the Ladybug 3,a hybrid approach featuring embedded 2.5D Kinect video of the destination actor within a VE, and a pure VE featuring

    Kinect-tracked embodied avatars.

    such as wired sensors or HMDs in order to partake inthe interaction.

    2.1.1 Acquisition. We have implemented twomethods of visual capture of the local environment,and three methods of visual capture of the local actor.Figure 2 shows how all of these modes appear at the visi-tor site, and these are discussed in Section 2.2. Spherical(∼288◦) video acquisition, which implicitly capturesboth actor and environment, is achieved with a PointGrey Research Ladybug 3 camera.2 The camera com-bines the views acquired by six 2 MP (1600 × 1200)Sony CCD sensors into a single panoramic image, whichhas a 12 MP (5400 × 2700) resolution. The processof de-mosaic, conversion, and stitching together thedifferent views is entirely done in software on the hostmachine. This solution, while giving control of theacquired image, creates an overhead in the computa-tional time required to acquire a surround video. Hence,the Ladybug 3 is capable of recording surround video atthe relatively low frame rate of 15 Hz at full resolution.This results in a potential bandwidth requirement of∼90 MB/s (assuming that each pixel is described as anRGB, 8-bit unsigned character). However, this unman-ageable wide-area bandwidth requirement is drasticallyreduced to ∼1 MB/s by reducing the sent resolutionto 2048 × 1024 pixels and using compression algo-rithms that we cover in Section 2.4. In summary, theLadybug 3 provides a simple means of visually capturingthe destination and the people within. This surrounding

    2. http://www.ptgrey.com

    acquisition is directly compatible with the immersive dis-play characteristics at the visitor and director sites, bothof which feature wide field-of-view displays.

    The second method of environment capture is ahand-modeled premade 3D textured scale model of thedestination room. The model’s wall textures are repro-jected from a 50 MP (10, 000 × 5, 000) panorama, whilethe furniture textures are extracted from single pho-tographs. During system use, the means of destinationacquisition determines key aspects of the visitor’s andspectator’s experience, including dynamism, spatiality,and fidelity. These issues are outlined in the followingsections describing display at the visitor and directorsites, and are explored during system use in Section 3covering the acting rehearsals.

    As described above, the first method of real-timedynamic visual capture of the actor at the destinationsite is achieved with the spherical video acquired by theLadybug 3. The other two modes of capturing the des-tination actor make use of the abilities of Microsoft’sKinect (depth plus RGB) sensor.3 The Kinect featuresa depth sensor consisting of an infrared laser projec-tor combined with a monochrome CMOS sensor. Thisallows the camera to estimate 3D information undermost ambient lighting conditions. Depth is estimated byprojecting a known pattern of speckles into the scene.The way that these speckles are distributed in the scenewill allow the sensor to reconstruct a depth map of theenvironment. The hardware consists of two camerasthat output two videos at 30 Hz, with the RGB video

    3. www.xbox.com/kinect

  • Steptoe et al. 411

    stream at 8-bit VGA resolution (640 × 480 pixels) andthe monochrome video stream used for depth sensingat 11-bit VGA resolution (640 × 480 pixels with 2,048levels of sensitivity).

    The first Kinect-based solution for capturing thedestination actor makes use of the OpenNI and NITEmiddleware libraries.4 OpenNI (Open Natural Interac-tion) is a multi-language, cross-platform framework thatdefines APIs for writing applications utilizing naturalinteraction. Natural interaction is an HCI concept thatrefers to the interface between the physical and virtualdomains based on detection of human action and behav-ior, in particular bodily movement and voice. To thisend, PrimeSense’s NITE (Natural Interaction Technol-ogy for End-user) adds skeletal recognition functionalityto the OpenNI framework. Making use of the depth-sensing abilities of the Kinect, NITE is able to track anindividual’s bodily movement, which it represents asa series of joints within a skeletal hierarchy. While thetracking data are not as high-fidelity, high-frequency, orlow-latency as a professional motion capture system (asdiscussed in Section 2.2 detailing the visitor site), NITEhas the significant advantages of being markerless, andrequiring minimal technical setup and calibration time.The calculated skeletal data are transmitted to the visitorand destination site at 30 Hz, and are used to animate agraphical avatar representing the destination actor. Theavatar is positioned correctly within the VE representa-tion mode of the destination, but would not be suited tothe 2D spherical Ladybug 3 video mode. This highlightsthe interplay between capture at one site and display atanother.

    The final mode of capturing the destination actor,and the second Kinect-based mode, streams a 2.5Dpoint-based textured video representation of the actor,independent of the environment, at 30 Hz. This repre-sentation may be considered a 2.5D video avatar, as weonly employ one front-facing Kinect to record the actor,and thus do not provide coverage of the rear half of thebody. It is likely possible to use several Kinects to ensurethat the destination actor is fully captured, as shownby Maimone and Fuchs (2012), but there are potential

    4. http://openni.org

    interference problems and this is an area for future work.In order to segment the destination actor from his envi-ronment, we combine knowledge from the NITE-basedskeletal tracking in order to only consider pixels withinthe depth range at which the tracked joints are currentlypositioned.

    2.1.2 Display. Display technology located at thedestination is responsible for representing the visitingactress in a manner that fosters a physical presence. Weconsider mobile telepresence robots (Tsui et al., 2011)and embodied social proxies (Venolia et al., 2010) tobestow the wearer with a high degree of physical pres-ence as they provide a mobile or situated totem bywhich they may be spatially located, looked toward, andreferred to. To this end, our current system architec-ture offers two solutions: a large high-resolution avatarprojection and a novel 360◦ spherical display showingan avatar head only. While both are not mobile (this isalso a limitation of the current destination acquisitiontechnology, and discussed in Section 3), and so restrictthe visitor’s gross movement around the destination,each display type aims to provide a distinct benefit to theremote interaction.

    Firstly, the projection avatar enables life-size andfull-body embodiment of the visitor. Due to the corre-sponding full-body motion capture setup at the visitorsite (detailed in the following section), the avatar ispuppeteered in real time. Thus, nuances of the visitingactress’ body language are represented. Due to the large3 × 2 m screen size, gross movement on the horizontaland vertical axes, and to a lesser extent on the forward-backward axis, are also supported. So, for instance, theactions of the visiting actress pacing from left to right,sitting down, and standing up, will be shown clearlyby the projection avatar. Furthermore, this movementwill be consistent both with respect to where the visit-ing actress perceives herself to be in the destination andwhere the destination actor perceives her to be. How-ever, forward and backward movement is limited, asthe visual result of such movement is that the visitor’srepresentation moves closer or farther away from thevirtual camera, resulting either in increased or reducedsize rather than being displayed at the physical posi-

  • 412 PRESENCE: VOLUME 21, NUMBER 4

    tion in the destination. Stereoscopic visualization couldbestow the projection avatar with a more convincingsense of the depth at which visitor is positioned, but(unlike a CAVE), the single-walled display featured atthe destination would restrict this illusion.

    The second mode of visitor display is the use of aGlobal Imagination Magic Planet.5 This spherical dis-play aims to enhance the ease of determining the 360◦

    directional attention of the visiting actress, and also tobestow her with a greater sense of physical presenceat the destination. The avatar head as displayed on theMagic Planet rotates and animates in real time based onhead tracking, eye tracking, and voice-detection dataacquired at the visitor site. So, as the visiting actressrotates her head, directs her gaze, and talks, her sphereavatar appears to do the same. Due to the alignmentbetween all participants’ perception of the shared spaceand the people within, the sphere avatar allows foraccurate 360◦ gaze awareness. The nature of the head-only situated sphere avatar implies that it is unable torepresent the visiting actress’ movement around thedestination space. However, due to the system’s flexi-ble architecture, the simultaneous dual-representationof both projection and sphere avatars is supported,harnessing the benefits of both display types. Stereospeakers and a microphone support verbal communi-cation. Implementation details, together with a userstudy, can be found in Oyekoya, Steptoe, and Steed(2012).

    2.2 Visitor Site

    The technology at the visitor site is responsible forboth capture of the visiting actress and the immersivedisplay of the destination and its collocated actor. In ourcurrent setup, this comprises a VR facility at which thetechnologies for acquisition and display are a full-bodymotion capture system and an immersive HMD, respec-tively. The physical characteristics of the visitor site arelargely inconsequential, as the visitor will be immersedin a virtual representation of the destination, and theacquisition that occurs here is solely that of the visi-

    5. http://www.globalimagination.com

    tor herself. We advocate the use of state-of-the-art IPTthrough the desire to foster social symmetry between allremote interactants. To this end, the HMD stimulatesthe wearer’s near-complete field-of-view, thus displayinga spatial and surrounding representation of the destina-tion and the locals within, while the motion-capture datatransmit body movement and nonverbal subtleties fromwhich the visitor’s avatar embodiment is animated at thedestination site.

    2.2.1 Acquisition. Capture of the visitor isachieved using a NaturalPoint Optitrack6 motion cap-ture system consisting of twelve cameras. The motioncapture volume is approximately 3 × 3 × 2.5 m, whichallows a one-to-one mapping between the visitor’smovements in the perceived virtual destination, andthe position of her embodiment at the physical desti-nation. By tracking the positions of reflective markersattached to the fitted motion capture suit, the systemcalculates near full-body skeletal movement (not fingersor toes) at submillimeter precision at 100 Hz. This skele-tal data has higher-fidelity than the equivalent KinectNITE tracking at the destination, but with the trade-offthat the visiting actress must wear a motion capture suitas shown in Figure 1, and the total setup and calibrationtime is greater (∼20 m as opposed to

  • Steptoe et al. 413

    VE, and a pure VE featuring Kinect-tracked and ani-mated avatars. Note that in all three modes, the visitorcan look down and see her own virtual body: an impor-tant visual cue for spatial reference and presence in aVE (Mohler, Creem-Regehr, Thompson, & Bulthoff,2010). The three modes are spatially aligned, and pro-vide a surrounding visual environment based on thephysical destination room. VRMedia’s XVR (Tecchiaet al., 2010) software framework is used to render theVE, and the avatars are rendered using the HardwareAccelerated Library for Character Animation (HALCA;Gillies & Spanlang, 2010). Stereo speakers and a micro-phone are used similarly at the destination to supportverbal communication.

    2.3 Director Site

    Since the director is not represented visually tothe actors, the director site does not require any specificacquisition technology. Hence, we mainly focus on thedirector site’s display system. In order to conduct andcritique the rehearsal between the two remote actors,the director must have audiovisual and spatial referenceto the unfolding interaction. The audiovisual compo-nent could be achieved through the use of a standard PCdisplaying the shared VE, and navigated using standardinput devices. However, this would diminish the direc-tor’s ability to naturally instruct the actors, and observeand refer to locations in the destination. Hence, we posi-tion the director in a four-walled CAVE-like system,where a surrounding and perspective-correct stereo-scopic view of the destination VE and the two actors isdisplayed. This enables the director to navigate freelyand naturally in the rehearsal space, shifting his view-point as desired through head-tracked parallax. Thus,the director views a VE of the destination, populatedwith the avatar embodiments of the two remote actors.

    Due to the differing acquisition and display technolo-gies at the destination and visitor sites, an asymmetryis introduced between the two actors in terms of bothfidelity and movement ability. The relative low fidelityof the destination’s Kinect NITE-based tracking com-pared with that of the visitor’s Optitrack motion capturehas been specified. The implications of this asymme-

    try is that the destination actor may be perceived tohave fewer degrees of freedom (e.g., NITE does nottrack head, wrist, and ankle orientation) than the vis-iting actress, and as a result may appear more rigidand, due to the lower capture rate, less dynamic. Therestriction upon the visitor’s range of movement, dueto the situated nature of the displays at the destina-tion, has also been noted. This restriction must simplybe acknowledged when planning the artistic directionof the scene. These issues are further expounded inSection 3. Headphones and a microphone complete theverbal communication.

    2.4 Transmission

    Communication between participants distributedover the three international sites relies on low-latencydata transmission. The multimodal nature of the variousmedia streams originating from the range of acquisi-tion devices at the destination and visitor sites meansthat a monolithic transmission solution is inappropriate,and would likely result in network congestion. Hence,we divide the media into two types by bandwidthrequirement.

    Low-bandwidth data comprise session managementand skeletal motion capture data based on discreteshared objects described by numeric or string-basedproperties. A client-server replicated shared objectdatabase using the cross-platform C++ engine RakNetby Jenkins Software is adopted.8 Clients create objectsand publish updates, which are then replicated acrossall connected clients. Objects owned by a client includeone to describe the client’s properties, and then as manyas there are joints in the chosen avatar rig. As a trackedactor moves and gestures, the owning client will updatethe properties (generally positions and rotations) of thecorresponding objects. These changes are then seri-alized and updated on all connected client databases.Different clients may be interested in different data.For instance, the client running the sphere avatar atthe destination will update information received fromthe visiting actress’ head node only, while the client

    8. http://www.jenkinssoft.com

  • 414 PRESENCE: VOLUME 21, NUMBER 4

    running the director’s CAVE system will update thefull-body motion capture data from both actors and ani-mate the avatars accordingly. Objects that are ownedby other clients may be queried, retrieving any updatessince the last query. Most of the data exchanged via theserver arises from skeletal motion capture at the des-tination and visitor sites. The primary purpose of thisdata transfer is for the real-time visualization of theactors’ avatar embodiments: the visitor actress beingdisplayed at both the destination and director sites,and the destination actor being displayed at the direc-tor site, and, depending on visualization mode, at thevisitor site. An orthogonal use of the data is sessionlogging for post analysis and replay. This logging isperformed on the server, and writes all node updatesto a human-readable file that is time-stamped from acentral time server. In addition, the server also recordsan audio file of all participants’ talk in OGG-Vorbisformat.9 Log files may be replayed both on nonimmer-sive and immersive displays including the CAVE andHMDs.

    High-bandwidth data are composed of video acquiredfrom the Ladybug 3 and Kinect cameras at the destina-tion site and transmitted to the visitor site for displayin the HMD. Several solutions to encode, transmit,and decode video streams, including the transmissionof color-plus-depth data, were investigated. Regard-ing color-plus-depth, it may seem that one should beable to stream these depth videos using standard videocodecs, such as Google’s VP810 or H.264.11 How-ever, the quality degrades considerably, because thecompression algorithms are geared toward standardthree-channel (8-bit) color video, whereas depth videosare single-channel but have a higher bit depth (e.g.,Kinect uses 11-bit depth). To this end, we have devel-oped a novel encoding scheme that efficiently convertsthe single-channel depth images to standard 8-bit three-channel images, which can then be streamed usingstandard codecs. Our encoding scheme ensures that thesent depth values are received and decoded with a high

    9. http://www.vorbis.com10. http://www.webmproject.org/tools/vp8-sdk11. http://www.itu.int/rec/T-REC-H.264

    degree of accuracy and is detailed in Pece, Kautz, andWeyrich (2011).

    Our current end-to-end video transmission solu-tion implements Google’s VP8 encoding with RakNetstreaming. The process is as follows: Microsoft’s Direct-Show12 is used to initially acquire images from a camerasource. The raw RGB images are converted to YUVspace, after which the YUV image is compressed usingthe libvpx VP8 codec library.13 The compressed framesare then sent as a RakNet bitstream to a server process(in our case located in Pisa, Italy), which simply relaysthe stream to other connected peers such as the visitorsite running the HMD. (RakNet’s bitstream class is amechanism to compress and transmit generic raw data.)Upon being received, the compressed VP8 frames aredecompressed to YUV and then converted back to RGBspace. The OpenGL-based XVR renderer then transfersthe RGB buffer to textures for final display. With theLadybug 3, our implementation achieves frame ratesof ∼13 Hz (from the original 15 Hz) in the visitor’sHMD, while the end-display frame rate of the Kinect2.5D video runs at ∼20 Hz (from the original 30 Hz).End-to-end latency of transmitted frames is

  • Steptoe et al. 415

    Figure 3. Simplified system architecture over the three sites. The inner boxes within each site eachrepresent a machine that is responsible for at least one element of the overall system. Hardware is signified

    by a preceding “–” symbol, while software processes are signified by a “+”. Lines between processesindicate wide-area network transmission: high-bandwidth video streams are shown as solid lines, while

    low-bandwidth session management and skeletal tracking data are indicated by dotted lines.

    connected to the shared object server, are indicated bydotted lines.

    According to Figure 3, the destination site featuresfive machines responsible for: rendering the sphereavatar, rendering the projection avatar, tracking thelocal actor’s body movement, capturing and stream-ing spherical video, and capturing and streaming 2.5Dvideo. The machines performing the avatar renderingand body tracking are connected via the Internet tothe RakNet-based shared object server at the directorsite in order to send and receive updates to and fromother connected clients. The machines responsible forthe video streaming do not require access to the sharedobject database, so instead compress and stream their

    processed video via VP8 and RakNet to a server in Pisa,which is then relayed to the HMD machine at the visitorsite for decompression and display. Note that the 2.5Dcapture using the Kinect also features a local-area net-work link to the skeletal tracking process (performed byanother Kinect) in order to segment the depth imagebased on the position of the local actor.

    Figure 3 shows the visitor site, consisting of twomachines. The machine responsible for acquisition runsan Optitrack motion capture system, which streams datato the display machine’s RakNet client process, con-nected on a local-area network. The display machineupdates the network with the visitor actor’s trackingdata, and also displays the destination environment in

  • 416 PRESENCE: VOLUME 21, NUMBER 4

    one of three modes as illustrated in Figure 2. The Inter-sense IS-900 head tracker provides XVR with robustpositional data in order for perspective-correct render-ing in the HMD.14 This acquisition-display arrangementis tightly coupled due to the tracking data informingthe HMD rendering, and, thus, there is logically onlyone shared-object client required at the visitor site. Incontrast, the local actor tracking that occurs at the des-tination is independent from the sphere and projectionavatar displays, and so three shared-object clients arerequired at the destination: one to update the destina-tion actor’s skeletal tracking, and one each to read anddisplay updates of the visitor actress’ avatar on the sphereand projection displays.

    Finally, the director site consists of two machines. TheCAVE (actually consisting of one master and four slavemachines) has a client connected to the shared objectserver, and displays the destination VE and both actorsas avatars. Intersense IS-900 head tracking enablesperspective-correct rendering. The shared object serveris located (arbitrarily) at the destination site, and pro-vides a range of ports to which incoming clients mayconnect. The logging process also occurs on the servermachine, recording both updates to the shared objectdatabase and talk between all participants.

    3 Acting Rehearsal

    Three experienced theater actors/directors werepaid £60 each to take part in the rehearsal, which tookplace over a period of 4 hr in a single afternoon. Priorto the rehearsal, the actors had, separately and apart,learned the “spider in the bathroom” scene from WoodyAllen’s 1977 movie Annie Hall.15 The scene beginswhen Alvy, played by Woody Allen, receives an emer-gency phone call (actually a false, manufactured crisis)to come to Annie’s (played by Diane Keaton) apartmentin the middle of the night. He arrives and an hysteri-cal Annie wants to be rescued from a big spider in herbathroom. Initially disgusted (“Don’t you have a can

    14. http://www.intersense.com/pages/20/1415. http://www.youtube.com/watch?v=XQMjrGnGHDY

    of Raid in the house? I told you a thousand times. Youshould always keep a lot of insect spray. You never knowwho’s gonna crawl over.”), Alvy skirts around the issuefor 2–3 min; firstly by discussing a rock concert programon Annie’s bureau, and then a National Review maga-zine that he finds on her coffee table. An arachnophobehimself, Alvy eventually goes on to thrash around in thebathroom, using Annie’s tennis racket as a swatter, inan attempt to kill the spider: “Don’t worry!” he callsfrom the bathroom, amidst the clatter of articles beingknocked off from a shelf.

    This scene was chosen as it consists of varied spatialand interpersonal interplay between the two characters.Thus, the actors engage in intense talk on varied sub-jects, spatial action (particularly Alvy’s character), anddirecting attention toward and manipulation of objectsin the environment. The scene’s duration is 3 min inthe original movie. This short length allows for mul-tiple run-throughs over the 4-hr rehearsal period, andencourages the director and actors to experiment withnew ideas and methods toward the final performance.Alvy was portrayed by a male actor at the destinationsite at University College London (UCL), and Anniewas played by an actress at the visitor site at Univer-sity of Barcelona (UB). The male director was locatedin UCL’s CAVE facility, separate from the destinationroom. Drawing from post-rehearsal discussion withthe actors and director, together with a group of threeexperienced theater artists and academics who werespectators at the two UCL sites, this section discussesthe successes and failures of the rehearsals in terms of thecentral elements of spatiality and embodiment. Figure 4illustrates the visual experience of the rehearsals over thethree sites.

    3.1 Spatiality

    The common spatial frame of reference experi-enced by all parties was highly conducive to the natureof theatrical acting and directing. The artists were ableto perform blocking, referring to the precise movementand positioning of actors in the space, with relative ease.This was demonstrated through the director issuingboth absolute and relative instructions interchange-

  • Steptoe et al. 417

    Figure 4. The acting rehearsal in progress at each of the three sites. Left: The destination site at UCL featuring thecopresent destination actor, and the visiting actress’ representations. Center: The visual stimuli (running in VE/Avatar display

    mode) of the destination site and actor displayed in the HMD worn by the visiting actress (not pictured) at UB. Right: The

    director located in UCL’s CAVE displaying the destination site and avatars representing the two actors.

    Figure 5. Screen captures from the rehearsal taken with our virtual replay tool, from left to right: picking up a leaflet fromthe bookcase, “That’s what you got me here for at 3 am in the morning, ’cause there’s a spider in the bathroom?” passing a

    glass of chocolate milk, pleading him to kill the spider “without squishing it,” and attempting to swat the spider: “don’t

    worry!”.

    ably. For instance, asking Alvy either to pick up themagazine “on the table” or “to Annie’s right” wereboth unambiguous to all parties due to the alignedvisual environment. The director was able to issue suchblocking instructions on both macro and micro scales,ranging from general positioning and Alvy’s point ofentrance into the scene, down to the technical aspects ofmovement on a per-line basis.

    The artists considered the asymmetry in allowed rangeof physical movement between the two actors, basedon their status as a local or a visitor, as a limitation. Thedestination actor was free to move around the entirerehearsal space, and would be observed by the visitingactress and director as doing so. However, the visitingactress’ allowed movement was limited, particularly for-ward and backward, due to her situated representationat the destination. The two displays at the destinationdiffer in terms of how well they accurately represent theposition of the visiting actress. The projection avatardisplay, which covers the whole rear wall of the desti-

    nation site, is able to represent horizontal and verticalmovement well. Depth cues, however, are less eas-ily perceived, and a forward movement performed bythe visiting actress results in the projection avatar get-ting larger, due to being closer to the virtual camera,rather than having physical presence at a location furtherinto the destination room. The situated sphere displayonly shows an avatar head representation, and so can-not express bodily movement at all. The implicationsof this differing movement allowance between the twoactors resulted in some frustrations for the director, whoreacted by issuing more gross blocking instructions toAlvy, while focusing more on instructing Annie’s expres-sive gestures. This situation, however, matched thescene’s dynamics, in which Alvy is the more physicallyactive of the two characters. Figure 5 illustrates somekey moments during the scene, captured with our virtualreplay tool.

    The solution to this issue of the visitor being spatiallyrestricted at the destination is the use of mobile displays.

  • 418 PRESENCE: VOLUME 21, NUMBER 4

    The use of personal telepresence robots is likely to solveone set of issues, to the detriment of others. On onehand, they would provide a physical entity with whichthe destination actor could “play distance,” and focusactions and emotions toward. However, the predomi-nant design of such devices is a face-only LCD display,recorded from webcam video. Thus, the fuller bodylanguage and gestural ability provided by the avatarrepresentation of the visitor would be missing, whichis a critical cue used throughout the processes of act-ing and direction. In addition, if we are to grant animmersive experience to the visitor, then the ability tocapture unobscured facial video is problematic in bothHMD- and CAVE-based (Gross et al., 2003) VR sys-tems due to worn devices. This is a rich area for futurework, and we are developing more sophisticated articu-lated teleoperated robots such as those detailed in Peer,Pongrac, and Buss (2010). Finally, it should be notedthat, particularly in film and television work, actors arehighly skilled at working with props: either when thereis nothing to play against as in green-screen work, orinanimate objects. Hence, the notion of an actor imbu-ing their imagination onto an empty space or object is anatural part of their process. This is a different situationfrom the practice of general interpersonal interaction, inwhich nonverbal cues provide information regarding thebeliefs, desires, and intentions of an individual, and alsoprovide indicators regarding various psychological states,including cognitive, emotional, physical, intentional,attentional, perceptual, interactional, and social (Duck,1986).

    Some general observations on the benefit of the com-mon frame of reference were also made. For instance,our senior guest academic, the Artistic Director of theRoyal Academy of Dramatic Art, discussed the way thatmany actors are able to learn their lines more quickly byphysically being in the rehearsal room or theatrical setas opposed to being in a neutral location such as theirown home. In particular, some older actors can onlylearn lines once they have established the blocking ofa scene. Hence, the interactive and visual nature of thesystem was considered highly beneficial to the process oflearning lines and planning movements, even in a solorehearsal setting.

    3.2 Embodiment

    The interactions were significantly influenced bymodes of embodiment and display at each site. Firstly, itshould be noted that throughout the rehearsal period,no critical failures in communication occurred. Whilewe have not formally measured the end-to-end latencyof all modes of capture, transmission, and display, thissuggests that it is acceptable to support both the ver-bal and nonverbal triadic interaction. The initial periodof acclimatization to the interaction paradigm resultedin some confusion between the three participants dueto the evident asymmetry between them. Each partywas unclear about the nature of the visual stimuli theothers were perceiving. Once some initial descriptionswere provided by each party (the destination actor onlyneeded to provide minimal information as he was physi-cally present in the place where the others were virtuallypresent), the group became confident about the unifiedspace they were all perceptually sharing, together withtheir displayed embodiments.

    The initial period of the rehearsal was used to deter-mine each participant’s local display preferences. At thedestination site, the projection avatar was preferred overthe sphere avatar or a dual-representation of the visitingactress. The destination actor considered the projectionavatar to provide more useful information through thedisplay of full-body language as opposed to the atten-tional cues that the head-only sphere display provided.Simultaneous use of the projection and sphere avatarswas disliked as it resulted in confusion due to divisionof attention between two locations. The projectionavatar bestowed Annie with a higher degree of physi-cal presence for Alvy to play against and observe (whichenhanced the physicality of the performance from Alvy’sperspective). During the post-rehearsal discussion, Alvyrecalled his excitement when Annie took a step towardhim, and an impression of their close proximity was pro-vided by the depth-cue of Annie’s avatar increasing insize on the projection display: “When you do go closeto the screen; when there are situations where you’reflirting, when she’s supposed to touch my chest and soon, that is really interesting because she’s in Barcelonaand I’m here, but there’s still some part of you that triesto reach out and touch her hand on the screen. And

  • Steptoe et al. 419

    when she reacts; for instance when I start smacking thefloor looking for that spider, she automatically did that[gestures to cover his head] sort of thing. There wasinterplay between us; a natural reaction to what I wasdoing. That was exciting and when the project shinedthe most, in my eyes.”

    The visiting actress wearing the HMD decided toobserve the destination using the spherical video modeas captured by the Ladybug 3. This mode preserved theactual appearance of both the destination and Alvy, withthe trade-off being a decreased perception of depth dueto the monoscopic video. This mode was preferred overboth the VE/Avatar and VE/2.5D video display modes,due to the improved dynamism of the video comparedto the “stiff” avatar embodiment that did not featureemotional facial expression, and the clearer image ofAlvy due to the higher resolution camera.

    Central to the system’s asymmetry are the physicalabilities of the two actors depending on at which sitethey are located. There are several moments during thescene when the actors are required to interact with eachother and their environment. This includes knockingon a door; looking at, picking up, and passing objects;and hitting an imaginary spider. When performing suchactions during the rehearsal, Alvy has a tangible senseof doing so due to the physicality of his local environ-ment. So, when he knocks on the door or picks up amagazine, he is doing just that, and these actions (andsounds) are observed by both Annie and the director,albeit in varying visual forms. However, this ability doesnot extend to Annie, as, regardless of display mode, thevisitor is only able to mime interaction with perceivedobjects that are, in reality, located at the destination.Fortunately, most of these moments in the scene belongto Alvy rather than Annie, so this issue did not result incritical failures. However, we identify the support forshared objects as an area for future investigation.

    The director in the CAVE viewed the rehearsals as aVE populated with the two actors’ virtual avatars. Dueto Annie’s actual appearance not being captured byvideo cameras, an avatar is her only available mode ofrepresentation at the other sites. While both video andavatar representations of Alvy are available in the CAVE,the director preferred visual consistency, and preferred

    the avatar representations in the VE. The VE was alsoconsidered to provide an excellent spatial representa-tion of the destination that was free of the details andclutter that existed in the real-world location, thus pro-viding a cleaner space in which to conduct the rehearsal.The director considered his ability to freely move andobserve the actors within the rehearsal space as a pow-erful feature of the system. He was able to observe thescene from any viewpoint, which allowed him to moveup close to the actors to instruct the expressive dynam-ics of their relationship, or stand back and observe theirpositions in the scene as a whole. The fact that the actorswere represented as life-sized avatars aided direction byenhancing the interpersonal realism of the rehearsal.Both actors noted our decision to not visually repre-sent the director. Although the benefit of the director’sunobstructive movements was universally acknowl-edged, the inability of the director to use nonverbalgesture, particularly pointing, was considered a hin-drance to the rehearsal process. Allowing the director tomake his representation visible or invisible to the actorsis a potentially interesting avenue of investigation thatmay have implications for general telecommunication insuch systems.

    The overall impression of the abilities afforded by theactors’ embodiments over the three sites was that move-ment and general intent was communicated well, butdetails of expressive behavior were lacking. Facial expres-sion, gaze, and finger movement were highlighted asthe key missing features. (Our system is able to track,transmit, and represent gaze and finger movement withhigh fidelity, and some facial expression is supported;however, these cues require participants to wear encum-bering devices, and so were decided against for thisrehearsal application.) As a result, moments in the scenethat have intended emotional prescience, such as thosefeaturing flirting, fear, and touch, appeared flat. In anattempt to counter these limitations of expressive ability,the actors noted that their natural (and at times sub-conscious) reaction was to over-act in order to elicit aresponse from their partner. Correspondingly, the direc-tor found himself requesting the actors to perform exag-gerated gestures and movements that he would not havedone if the finer facets of facial expression were available.

  • 420 PRESENCE: VOLUME 21, NUMBER 4

    3.3 Discussion

    Depending on the characteristics of the play orproduction, the artists speculated that rehearsing usingthe multimodal MR system could reduce the subsequentrequired collocated rehearsal time by up to 25%. Theprimary benefit to the rehearsal would be blocking thescene, planning actors’ major bodily gestures, and, in thecase of television and film work, planning camera shotsand movement. In television and film work, the artistsnoted that rehearsal is often minimal or nonexistent dueto time and travel constraints. The system provides apotentially cheaper and less time-consuming mode ofbeing able to rehearse with remote colleagues. This ben-efit would likely extend to technical operators and setdesigners, who would be able to familiarize themselveswith the space in order to identify locations for technicalequipment, and optimize lighting and prop-placement.The heterogeneity and multimodal nature of the systemwas also suggested as a novel paradigm for live perfor-mance in its own right, including the potential for artand science exhibitions, and even reality television.

    Blocking and spatial dimensions are paramount to atheatrical scene, and determining these aspects is fre-quently divisive between international performers. Suchdisputes may be reduced or settled early by allowing allparties to virtually observe and experience the rehearsalor set layout prior to a collocated performance. Bothactors and the director advocated the system as a meansof overcoming the initial apprehension and nervousnessof working with one another, and suggested that theywould be more immediately comfortable when the timecame for a subsequent collocated meeting. Solo perfor-mance and reviewing prior run-throughs was discussedas a potentially useful mode of system operation. To thisend, the virtual replay abilities of the system allow forrandom-access and time-dilation of previous sessions.We are currently implementing an autonomous proxyagent representing an absent human actor that is able torespond in real time to the tracked nonverbal and verbalcues. This may prove useful for solo rehearsal, presentinga responsive humanoid entity to play against.

    The relative inexpressivity of the actors’ embodiments(due either to the stiff avatar representations or the sub-

    optimal resolution of the video streams) implies thatscenes relying on performing and reacting to conse-quential facial expression and subtle gesture would notbenefit significantly from rehearing via the system in itscurrent form. To this end, opera rehearsal was suggestedby the observers as a potentially compatible domain.Opera has a number of characteristics that suggest itscompatibility with conducting rehearsals via the system.Opera performance relies heavily on gross gesticulation,and while facial expression and minute movements arecertainly important, they are perhaps less salient to theoverall performance than they are in dramatic theatri-cal and filmed work. This is illustrated further by thereduced amount of interplay between opera perform-ers, as they are often projecting to an audience. A keystrength of the system is its ability for remote partici-pants to move within and observe a perceptually unifiedspace. Therefore, operatic performers, who, due totight international schedules, can often dedicate onlyminimal collocated rehearsal time, may find the systemuseful for familiarizing themselves with the stage spaceand planning action. Additionally, the progression andtiming of an operatic performance is dictated by themusical score, thus leaving minimal room for improvi-sation. This strictly-structured performance could serveto offset some of the expressive failings of the currentsystem, as behavioral options at a given point in timeare limited, thereby potentially reducing instructionalcomplexity.

    4 Conclusions

    This paper has presented our experience to dateof setting up a distributed multimodal MR system tosupport remote acting rehearsals between two remoteactors and a director. Using a range of capture and dis-play devices, our heterogeneous architecture connectsthe people located in three technologically distinctlocations via a spatial audiovisual telecommunicationsmedium based on the destination-visitor paradigm.The implications and characteristics of the paradigmhave been explored; the paradigm is emerging due tothe geographical distribution of teams at technologi-

  • Steptoe et al. 421

    cally asymmetric sites. Aiming to support distributedrehearsal between two actors and a director, the tech-nological setup at each site has been detailed, focusingon acquisition, display, and data transmission. Therehearsals were then covered, exploring the successesand failures of the system in terms of the central aspectsof spatiality and embodiment. Overall, the commonspatial frame of reference presented by the system toall parties was highly conducive to theatrical actingand directing, allowing blocking, gross gesture, andunambiguous instruction to be issued. The relative inex-pressivity of the actors’ embodiments was identified asthe central limitation of the telecommunication, mean-ing that moments relying on performing and reacting toconsequential facial expression and subtle gesture wereless successful. We have highlighted spatialized audio,haptics, mobile embodiments, surround acquisition ofdepth images, facial expression capture, and support-ing shared objects as key areas for future work towardadvanced iterations of this heterogeneous multimodalform of MR telecommunication.

    Acknowledgments

    We would like to thank our actors, Jannik Kuczynski and Jas-mina Zuazaga; our director, Morgan Rhys; and the ArtisticDirector of the Royal Academy of Dramatic Art, EdwardKemp, and his colleagues. This work is part of the BEAM-ING project supported by the European Commission underthe EU FP7 ICT Work Programme.

    References

    Axelsson, A., Abelin, Å., Heldal, I., Nilsson, A., Schroeder,R., & Widestrom, J. (1999). Collaboration and commu-nication in multi-user virtual environments: A comparisonof desktop and immersive virtual reality systems for molec-ular visualization. In Proceedings of the Sixth UKVRSIGConference, 107–117.

    Benford, S., Greenhalgh, C., Reynard, G., Brown, C., &Koleva, B. (1998). Understanding and constructing sharedspaces with mixed-reality boundaries. ACM Transactions onComputer–Human Interaction (TOCHI), 5(3), 185–223.

    Duck, S. (1986). Human relationships: An introduction tosocial psychology. Thousand Oaks, CA: Sage.

    Gillies, M., & Spanlang, B. (2010). Comparing and evalu-ating real time character engines for virtual environments.Presence: Teleoperators and Virtual Environments , 19(2),95–117.

    Gross, M., Wurmlin, S., Naef, M., Lamboray, E., Spagno,C., Kunz, A., et al. (2003). Blue-C: A spatially immer-sive display and 3D video portal for telepresence. ACMTransactions on Graphics , 22(3), 819–827.

    Maimone, A., Bidwell, J., Peng, K., & Fuchs, H. (2012).Enhanced personal autostereoscopic telepresence systemusing commodity depth cameras. Computers & Graphics ,36(7), 791–807.

    Milgram, P., & Kishino, F. (1994). A taxonomy of mixed real-ity visual displays. IEICE Transactions on Information andSystems E, Series D, 77 , 1321.

    Mohler, B., Creem-Regehr, S., Thompson, W., & Bulthoff,H. (2010). The effect of viewing a self-avatar on distancejudgments in an HMD-based virtual environment. Pres-ence: Teleoperators and Virtual Environments , 19(3), 230–242.

    Normand, J., Spanlang, B., Tecchia, F., Carrozzino, M.,Swapp, D., & Slater, M. (2012). Full body acting rehearsalin a networked virtual environment—A case study. Presence:Teleoperators and Virtual Environments , 21(2), 229–243.

    Oyekoya, O., Steptoe, W., & Steed, A. (2012). Sphereavatar:A situated display to represent a remote collaborator. InProceedings of the SIGCHI Conference on Human Factors inComputing Systems.

    Pece, F., Kautz, J., & Weyrich, T. (2011). Adapting stan-dard video codecs for depth streaming. In Proceedings ofthe Joint Virtual Reality Conference of EuroVR (JVRC),59–66.

    Peer, A., Pongrac, H., & Buss, M. (2010). Influence of var-ied human movement control on task performance andfeelings of telepresence. Presence: Teleoperators and VirtualEnvironments , 19(5), 463–481.

    Schroeder, R., Steed, A., Axelsson, A., Heldal, I., Abelin,Å., Wideström, J., et al. (2001). Collaborating in net-worked immersive spaces: As good as being there together?Computers & Graphics , 25(5), 781–788.

    Slater, M., Howell, J., Steed, A., Pertaub, D., & Garau, M.(2000). Acting in virtual reality. In Proceedings of theThird International Conference on Collaborative VirtualEnvironments , 103–110.

  • 422 PRESENCE: VOLUME 21, NUMBER 4

    Slater, M., Sadagic, A., Usoh, M., & Schroeder, R. (2000).Small-group behavior in a virtual and real environment:A comparative study. Presence: Teleoperators and VirtualEnvironments , 9(1), 37–51.

    Tecchia, F., Carrozzino, M., Bacinelli, S., Rossi, F., Vercelli,D., Marino, G., et al. (2010). A flexible framework forwide-spectrum VR development. Presence: Teleoperatorsand Virtual Environments , 19(4), 302–312.

    Tsui, K., Desai, M., Yanco, H., & Uhlik, C. (2011). Exploringuse cases for telepresence robots. In Proceedings of the 6th

    International Conference on Human–Robot Interaction,11–18.

    Venolia, G., Tang, J., Cervantes, R., Bly, S., Robertson, G.,Lee, B., et al. (2010). Embodied social proxy: Mediatinginterpersonal connection in hub-and-satellite teams. In Pro-ceedings of the 28th International Conference on HumanFactors in Computing Systems, 1049–1058.

    Voida, A., Voida, S., Greenberg, S., & He, H. (2008). Asym-metry in media spaces. In Proceedings of the 2008 ACMConference on Computer Supported Cooperative Work,313–322.