1
Conversational Interface Agents facilitate natural interactions in Virtual Reality Environments. Bielefeld University Deictic expressions pointing gestures regions or objects primary task Virtual Reality manipulation of objects selection of objects ray casting occlusion arm extension interaction mediated Embodied Conversational Agent understanding natural communication and natural gestures robustness and accuracy interpretation natural pointing gestures study pointing-based conversational interactions immersive Virtual Reality (such as “ ”) are fundamental in human communication to refer to entities in the environment. In situated contexts, deictic expressions often comprise directed at . One of the in applications is the visually perceivable . Thus VR research has focused on developing metaphors that optimize the tradeoff between a swift and precise . Prominent examples are , , or . These technologies are well suited for interacting directly with the system. When the with the system is , e.g., by an (ECA), the primary focus lies on a smooth of . It is thus recommended to improve the of the of , i.e., gestures without facilitation per visual aids or other auxiliaries. To attain these ends, we contribute results from a on pointing and draw conclusions for the implementation of in . put that there Conversational Pointing Gestures for Virtual Reality Interaction Implications from an Empirical Study Thies Pfeiffer tpfeiffe @techfak.uni-bielefeld.de Ipke Wachsmuth ipke @techfak.uni-bielefeld.de m m m m m m m m m m m m Measuring and Reconstructing Pointing in Visual Contexts Deictic Object Reference in Task-oriented Dialogue Processing Instructions Deixis: How to Determine Demonstrated Objects Using a Pointing Cone Resolving Object References in Multimodal Dialogues for Immersive Virtual Environments Resolution of Multimodal Object References Using Conceptual Short Term Memory 3D User Interfaces – Theory and Practice. Mutual Disambiguation of 3D Multimodal Interaction in Augmented and Virtual Reality. A Gesture Processing Framework for Multimodal Interaction in Virtual Reality. SenseShapes: Using Statistical Geometry for Object Selection in a Multimodal Augmented Reality System. A Virtual Interface Agent and its Agency. Towards Preferences in Virtual Environment Interfaces. Kranstedt, Lücking, Pfeiffer, Rieser & Staudacher Kranstedt, Lücking, Pfeiffer, Rieser & Wachsmuth Mouton de Gruyter, Berlin. 2006. Weiß, Pfeiffer, Eikmeyer & Rickheit Mouton de Gruyter, Berlin. 2006. Kranstedt, Lücking, Pfeiffer, Rieser & Wachsmuth . Springer-Verlag GmbH, Berlin Heidelberg. 2006. Pfeiffer & Latoschik Pfeiffer, Voss & Latoschik D. A. Bowman, E. Kruijff, J. Joseph J. LaViola, and I. Poupyrev. Addison- Wesley, 2005. E. Kaiser, A. Olwal, D. McGee, H. Benko, A. Corradini, X. Li, P. Cohen, and S. Feiner. In , pages 12–19. ACM Press, 2003. M. E. Latoschik. In , AFRIGRAPH 2001, pages 95–100. ACM SIGGRAPH, 2001. A. Olwal, H. Benko, and S. Feiner. In , pages 300–301, Tokyo, Japan, October 7–10 2003. I. Wachsmuth, B. Lenzmann, T. Jörding, B. Jung, M. Latoschik, and M. Fröhlich. In , pages 516–517, 1997. C. A. Wingrave, D. A. Bowman, and N. Ramakrishnan. In , pages 63–72, Aire-la-Ville, Switzerland, Switzerland, 2002. Eurographics Association. Proceedings of the Brandial 2006 Situated Communication. Situated Communication. 6th International Gesture Workshop Proceedings of the IEEE Virtual Reality 2004 Proceedings of the EuroCogSci03. Proceedings of the 5th International Conference on Multimodal Interfaces Proceedings of the 1st International Conference on Computer Graphics, Virtual Reality and Visualisation in Africa Proceedings of The Second IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2003) Proceedings of the First International Conference on Autonomous Agents EGVE ’02: Proceedings of the Workshop on Virtual Environments 2002 Research Background Research Background Acknowledgements Acknowledgements This work has been funded by the EC Deutsche Forschungsgemeinschaft (DFG) in the Collaborative Research Center 360, “ ”. in the project , FP6 - IST program - reference number 27654, and by the PASION Situated Articial Communicators Bibliography Bibliography Secondary Future Work m m m Taking the direction of gaze into account does not always improve performance (contrary to the mainstream opinion). Humans display a non-linear behavior at the borders of the domain Confirm the results in a mixed setting with one human and one embodied conversational agent over virtual objects At least in the setting used in our study, with widely spaced objects (20 cm), it can be ignored when going for high overall success A combined model for the extension of proximal and distal pointing. The boundary between proximal and distal pointing is defined by the personal distance . d Motivation Aim How accurate are pointing gestures? m m m Improved Advances for the and Contributing to more models for human pointing interpretation production of pointing gestures robust multimodal conversational interfaces Applications m m m m Human-Computer Interaction Human-Agent Interaction Assistive Technology Empirical research Usability Studies - multimodal interfaces - Human-Robot Interaction and - automatic multimodal annotation - automatic grounding of gestures based on a world model - multimodal conversational interfaces Observations Modelling approach m m m m m (as expected) Ellipse shapes of the bagplots suggest a cone-based model of the extension of pointing Pointing is fuzzy , but even in short range distance Fuzziness increases with distance Overshooting at the edge of the domain (intentional) Still, the human object identifier shows a good performance of 83.9%correct identifications AI Group Faculty of Technology Bielefeld University Germany PASION Marc E. Latoschik marcl @techfak.uni-bielefeld.de Situated Artificial Communicators SFB 360 Method How to determine the parameters of the pointing extension model? What is the opening angle? What defines the anchor of the model? Results Pragmatic Model IFP is more precise than GFP distinguish between proximal and distal pointing When the pointing extension is handled on the level of pragmatics, we can allow for inference mechanisms to disambiguate between several objects and, hence, use heuristics. For the simulation, we used a basic heuristics based on angular distance between objects and the pointing-ray. Again, the table on the right depicts the results of our simulation runs. This time IFP performs better than GFP. . The opening angles in the proximal rows are rather large, while the angles in the more distal angles are much smaller. This motivates us to . Primary m m m m Pointing is best interpreted at the level of pragmatics and not semantics Index-Finger-Pointing is more precise Gaze-Finger-Pointing is more accurate The results stated in the tables to the left and our qualitative observations using IADE suggest a dichotomy of proximal vs. distal pointing. This fits nicely with the dichotomy common in many languages (here vs. there) Conclusion row IFP GFP α perf. α perf. 1 84 70.27 86 68.92 2 80 61.84 68 75 3 71 71.43 69 81.82 4 60 53.95 38 65.79 5 36 43.84 24 57.53 6 24 31.15 25 42.62 7 14 23.26 17 23.26 8 10 7.14 10 14.29 all 71 38.54 61 48.12 row IFP GFP α perf. α perf. 1 120 98.65 143 98.65 2 109 100 124 100 3 99 94.81 94 93.51 4 109 98.68 89 93.42 5 72 97.26 75 94.52 6 44 91.8 50 90.16 7 38 86.05 41 67.44 8 31 52.38 26 69.05 all 120 96.04 143 92.71 Simulations Exploring the data The optimal opening angles per row for a of the pointing extension. In addition, the performance in terms of correctly identified objects in percent of all objects within the specified area is depicted, both for IFP and GFP. The row titled all shows the performance for rows 1-7, row 8 has been excluded because of the overshooting behavior. a strict semantic pointing cone model The optimal opening angles per row for a of the pointing extension. In addition, the performance in terms of correctly identified objects in percent of all objects within the specified area is depicted, both for IFP and GFP. The row titled all shows the performance for rows 1-7, row 8 has been excluded because of the overshooting behavior. a pragmatic pointing cone model A visualization of the intersections of the pointing-ray (dots) for four different objects over all participants. The data is grouped via bagplots, the asterisk marks the mean, the darker area clusters 50 percent, the brighter area 75 percent of the demonstrations. In the depicted setting the person pointing was standing to the left, the person identifying the objects to the right. (M1) (M2) (M3) Study on object pointing m m m m m Interaction of two participants Two conditions: speech + gesture and gesture only Real objects Study with 62 participants Cooperative effort with linguists proximal cone distal cone distance d Technology m m m m Audio + video recordings Motion capturing using ART GmbH optical tracking system Automatic adaptation of a model of the user’s posture Special hand-made soft gloves Interaction Game m m m m m m Description Giver is presented with the object to demonstrate Description Giver utters deictic expression (s+g or g only) Object Identifier tries to identify the object Description Giver gives feedback (yes/no) Proceed with next object No corrections or repairs! The study combines multimodal data comprising audio, video, motion capture and annotation data. A coherent synchronized view of all data sources is provided using the Interactive Augmented Data Explorer (IADE), developed at Bielefeld University. The picture to the right shows a session with the Interactive Aug- mented Data Explorer. The scientist to the right interactively explores the recorded data for qualitative analysis. The model to the left shows the table with the objects and a stick- figure driven by the motion capture data. The video taken from one camera perspective is displayed on the floating panel, together with the audio recordings. Information from specific annotation tiers is presented as floating text. The scientist can, e.g., take the perspective of the description giver or the object identifier. All elements are interactive, e.g. the video panels and annotations can be resized or positioned to allow for a comfortable investigation. Several simulation runs have been conducted to test different approaches to model the pointing extension on the collected data. For a strict semantic model the pointing extension has to single out object. In the simulation runs we determined the optimal angle for such a pointing cone per row and for the overall area. The results are depicted in the table on the right. For the strict semantic model the GFP offers better performance while having a narrower opening angle. Strict Semantic Model one and only one GFP is more accurate than IFP. Do we aim with the index finger (IFP) or by gazing over the index finger (GFP)? spotlight camera display - M1: task - M2 + M3: system time tracking system

put that there Aim m models for human pointing m interpretation …tpfeiffe/pubs/2008... · 2013. 3. 18. · Deixis: How to Determine Demonstrated Objects Using a Pointing Cone

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Conversational Interface Agentsfacilitate natural interactions inVirtual Reality Environments.

    Bielefeld University

    Deictic expressions

    pointinggestures regions or objects primary task

    Virtual Reality manipulation ofobjects

    selection of objectsray casting occlusion arm extension

    interaction mediatedEmbodied Conversational Agent

    understanding natural communication andnatural gesturesrobustness and accuracy interpretation naturalpointing gestures

    study

    pointing-based conversational interactions immersive VirtualReality

    (such as “ ”) are fundamentalin human communication to refer to entities in the environment.In situated contexts, deictic expressions often comprise

    directed at . One of thein applications is the visuallyperceivable . Thus VR research has focused ondeveloping metaphors that optimize the tradeoff between aswift and precise . Prominent examples are

    , , or . These technologies arewell suited for interacting directly with the system.When the with the system is , e.g., by an

    (ECA), the primary focus lieson a smooth of

    . It is thus recommended to improve theof the of

    , i.e., gestures without facilitation per visualaids or other auxiliaries.To attain these ends, we contribute results from a onpointing and draw conclusions for the implementation of

    in.

    put that there

    Conversational Pointing Gestures for Virtual Reality InteractionImplications from an Empirical Study

    Thies [email protected]

    Ipke [email protected]

    m

    m

    m

    m

    m

    m

    m

    m

    m

    m

    m

    m

    Measuring and Reconstructing Pointing inVisual Contexts

    Deictic Object Reference in Task-orientedDialogue

    Processing Instructions

    Deixis: How to Determine DemonstratedObjects Using a Pointing Cone

    Resolving Object References in MultimodalDialogues for Immersive VirtualEnvironments

    Resolution of Multimodal ObjectReferences Using Conceptual Short TermMemory

    3D UserInterfaces – Theory and Practice.

    Mutual Disambiguation of 3D MultimodalInteraction in Augmented and VirtualReality.

    A Gesture ProcessingFramework for Multimodal Interaction inVirtual Reality.

    SenseShapes: Using Statistical Geometryfor Object Selection in a MultimodalAugmented Reality System.

    AVirtual Interface Agent and its Agency.

    Towards Preferences inVirtual Environment Interfaces.

    Kranstedt, Lücking, Pfeiffer, Rieser &Staudacher

    Kranstedt, Lücking, Pfeiffer, Rieser &Wachsmuth

    Mouton de Gruyter, Berlin. 2006.

    Weiß, Pfeiffer, Eikmeyer & Rickheit

    Mouton de Gruyter, Berlin. 2006.

    Kranstedt, Lücking, Pfeiffer, Rieser &Wachsmuth

    .Springer-Verlag GmbH, Berlin Heidelberg.2006.

    Pfeiffer & Latoschik

    Pfeiffer, Voss & Latoschik

    D. A. Bowman, E. Kruijff, J. Joseph J.LaViola, and I. Poupyrev.

    Addison-Wesley, 2005.E. Kaiser, A. Olwal, D. McGee, H. Benko, A.Corradini, X. Li, P. Cohen, and S. Feiner.

    In

    , pages 12–19. ACM Press, 2003.M. E. Latoschik.

    In

    , AFRIGRAPH 2001, pages 95–100.ACM SIGGRAPH, 2001.A. Olwal, H. Benko, and S. Feiner.

    In

    , pages 300–301,Tokyo, Japan, October 7–10 2003.I. Wachsmuth, B. Lenzmann, T. Jörding, B.Jung, M. Latoschik, and M. Fröhlich.

    In

    , pages516–517, 1997.C. A. Wingrave, D. A. Bowman, and N.Ramakrishnan.

    In

    , pages 63–72,Aire-la-Ville, Switzerland, Switzerland, 2002.Eurographics Association.

    Proceedings of the Brandial 2006

    Situated Communication.

    Situated Communication.

    6th International Gesture Workshop

    Proceedings of the IEEE Virtual Reality 2004

    Proceedings of the EuroCogSci03.

    Proceedings of the 5thInternational Conference on MultimodalInterfaces

    Proceedings of the 1stInternational Conference on ComputerGraphics, Virtual Reality and Visualisationin Africa

    Proceedingsof The Second IEEE and ACM InternationalSymposium on Mixed and AugmentedReality (ISMAR 2003)

    Proceedings of the First InternationalConference on Autonomous Agents

    EGVE’02: Proceedings of the Workshop onVirtual Environments 2002

    Research BackgroundResearch Background

    AcknowledgementsAcknowledgements

    This work has been funded by theEC

    Deutsche Forschungsgemeinschaft(DFG) in the CollaborativeResearch Center 360, “

    ”.

    in the project ,FP6 - IST program - referencenumber 27654, and by the

    PASION

    SituatedArtificial Communicators

    BibliographyBibliography

    Secondary

    Future Work

    m

    m

    m

    Taking the direction of gaze into account does not always improveperformance (contrary to the mainstream opinion).

    Humans display a non-linear behavior at the borders of the domain

    Confirm the results in a mixed setting with one human and oneembodied conversational agent over virtual objects

    At least in thesetting used in our study, with widely spaced objects (20 cm), it canbe ignored when going for high overall success

    A combined model for the extension ofproximal and distal pointing. The boundarybetween proximal and distal pointing isdefined by the personal distance .d

    Motivation

    Aim

    How accura

    te are point

    ing gestures

    ?

    m

    m

    m

    ImprovedAdvances for the and

    Contributing to more

    models for human pointinginterpretation

    production of pointing gesturesrobust

    multimodal conversational interfaces

    Applicationsm

    m

    m

    m

    Human-Computer Interaction

    Human-Agent Interaction

    Assistive TechnologyEmpirical research Usability Studies

    - multimodal interfaces

    - Human-Robot Interaction

    and- automatic multimodal annotation- automatic grounding of gestures based on a world model

    - multimodal conversational interfaces

    Observations

    Modelling approach

    m

    m

    m

    m

    m

    (as expected)

    Ellipse shapes of the bagplots suggest a cone-based model of theextension of pointing

    Pointing is fuzzy , but even in short range distanceFuzziness increases with distanceOvershooting at the edge of the domain (intentional)Still, the human object identifier shows a good performance of83.9%correct identifications

    AI GroupFaculty of TechnologyBielefeld UniversityGermany

    PASION

    Marc E. [email protected]

    SituatedArtificialCommunicators

    SFB 360

    Method

    How to determine the parameters of the pointing extension model?

    What is the opening angle?

    What defines the anchor of the model?

    Results

    Pragmatic Model

    IFP is more precisethan GFP

    distinguish between proximal and distalpointing

    When the pointing extension is handled on thelevel of pragmatics, we can allow for inferencemechanisms to disambiguate between severalobjects and, hence, use heuristics. For thesimulation, we used a basic heuristics based onangular distance between objects and thepointing-ray. Again, the table on the right depictsthe results of our simulation runs. This time IFPperforms better than GFP.

    . The opening angles in the proximalrows are rather large, while the angles in the moredistal angles are much smaller. This motivates usto

    .

    Primarym

    m

    m

    m

    Pointing is best interpreted at the level of pragmatics and not semanticsIndex-Finger-Pointing is more preciseGaze-Finger-Pointing is more accurateThe results stated in the tables to the left and our qualitativeobservations using IADE suggest a dichotomy of proximal vs. distalpointing. This fits nicely with the dichotomy common in manylanguages (here vs. there)

    Conclusion

    row IFP GFP

    α perf. α perf.

    1 84 70.27 86 68.92

    2 80 61.84 68 75

    3 71 71.43 69 81.82

    4 60 53.95 38 65.79

    5 36 43.84 24 57.53

    6 24 31.15 25 42.62

    7 14 23.26 17 23.26

    8 10 7.14 10 14.29

    all 71 38.54 61 48.12

    row IFP GFP

    α perf. α perf.

    1 120 98.65 143 98.65

    2 109 100 124 100

    3 99 94.81 94 93.51

    4 109 98.68 89 93.42

    5 72 97.26 75 94.52

    6 44 91.8 50 90.16

    7 38 86.05 41 67.44

    8 31 52.38 26 69.05

    all 120 96.04 143 92.71

    Simulations

    Exploring the data

    The optimal opening angles per row for aof the pointing extension. In

    addition, the performance in terms of correctly identifiedobjects in percent of all objects within the specified area isdepicted, both for IFP and GFP. The row titled all showsthe performance for rows 1-7, row 8 has been excludedbecause of the overshooting behavior.

    a strict semanticpointing cone model

    The optimal opening angles per row for aof the pointing extension. In

    addition, the performance in terms of correctly identifiedobjects in percent of all objects within the specified area isdepicted, both for IFP and GFP. The row titled all showsthe performance for rows 1-7, row 8 has been excludedbecause of the overshooting behavior.

    a pragmaticpointing cone model

    A visualization of the intersections of the pointing-ray (dots) for four different objects over allparticipants. The data is grouped via bagplots, the asterisk marks the mean, the darker areaclusters 50 percent, the brighter area 75 percent of the demonstrations. In the depicted settingthe person pointing was standing to the left, the person identifying the objects to the right.

    (M1)

    (M2)

    (M3)

    Study on object pointingm

    m

    m

    m

    m

    Interaction of two participantsTwo conditions: speech +gesture and gesture onlyReal objectsStudy with 62 participantsCooperative effort withlinguists

    proximalcone

    distalcone

    distance d

    Technologym

    m

    m

    m

    Audio + video recordingsMotion capturing using ARTGmbH optical tracking systemAutomatic adaptation of amodel of the user’s postureSpecial hand-made soft gloves

    Interaction Gamem

    m

    m

    m

    m

    m

    Description Giver is presented withthe object to demonstrateDescription Giver utters deicticexpression (s+g or g only)Object Identifier tries to identify theobjectDescription Giver gives feedback(yes/no)Proceed with next objectNo corrections or repairs!

    The study combines multimodaldata comprising audio, video,motion capture and annotationdata. A coherent synchronized viewof all data sources is provided usingthe Interactive Augmented DataExplorer (IADE), developed atBielefeld University.The picture to the right shows asession with the Interactive Aug-mented Data Explorer. The scientistto the right interactively explores therecorded data for qualitativeanalysis. The model to the left showsthe table with the objects and a stick-figure driven by the motion capturedata. The video taken from onecamera perspective is displayed onthe floating panel, together with theaudio recordings. Information from

    specific annotation tiers is presented as floating text.The scientist can, e.g., take the perspective of thedescription giver or the object identifier. Allelements are interactive, e.g. the video panels andannotations can be resized or positioned to allow fora comfortable investigation.

    Several simulation runs have been conducted totest different approaches to model the pointingextension on the collected data.

    For a strict semantic model the pointing extensionhas to single out object. In thesimulation runs we determined the optimal anglefor such a pointing cone per row and for theoverall area. The results are depicted in the tableon the right. For the strict semantic model theGFP offers better performance while having anarrower opening angle.

    Strict Semantic Model

    one and only one

    GFP is more accuratethan IFP.

    Do we aim with the index finger (IFP) orby gazing over the index finger (GFP)?

    spotlight

    camera

    display- M1: task- M2 + M3: system time

    tracking system