4
EyesWeb XMI: a platform for recording and real-time analysis of multimodal data streams Gualtiero Volpe, Paolo Coletta, Simone Ghisio, Antonio Camurri Casa Paganini - InfoMus, DIBRIS - University of Genova, 16145 Genova, Italy, Email: [email protected] Introduction Auditory scene analysis and, in general, multimodal scene analysis can take great benefit from the availability of flexible tools enabling fast development of prototypes for collecting and recording multimodal data, previewing, and analysing it. Fast prototype deployment allows for implementation and test of algorithms without going through time-consuming and expensive development cy- cles. Whereas general-purpose tools such as, for example, Mathworks’ Simulink exist since long time, platforms especially devoted to (possible real-time) analysis of multimodal signals are far less common. Max [1] is a platform and a visual programming language for music and multimedia, originally developed by Miller Puckette at IRCAM, Paris, and nowadays developed and maintained by Cycling ’74. Conceived for sound and music processing in interactive computer music, it was also endowed with packages for real-time video, 3-D, and matrix processing. Pd (Pure Data ) [2] is similar in scope and design to Max. It also includes a visual programming language and it is intented to support development of interactive computer music applications. The addition of GEM (Graphics Environment for Multimedia) enables real-time generation and processig of video, OpenGL graphics, images, and so on. Moreover, Pd is natively designed to enable live collaboration across networks or the Internet. vvvv [3] is a hybrid graphical/textual programming environment for easy prototyping and de- velopment. It has a special focus on real-time video synthesis and it is designed to facilitate the handling of large media environments with physical interfaces, real-time motion graphics, audio and video that can interact with many users simultaneously. Isadora [4] is an interactive media presentation tool created by composer and media-artist Mark Coniglio. It mainly includes video generarion, processing, and effects and is intentend to support artists in developing interactive performances. In the same field of performing arts, Eyecon [5] aims at facilitating interactive performances and installations in which the motion of human bodies is used to trigger or control various other media, e.g., music, sounds, photos, films, lighting changes, and so on. The Social Signal Interpretation framework (SSI ) [6] offers tools to record, analyse and recognize human behavior in real-time, such as gestures, mimics, head nods, and emotional speech. Following a patch-based design, pipelines are set up from autonomic components and allow the parallel and synchronised processing of sensor data from multiple input devices. This paper introduces recent advances in the EyesWeb XMI platform (http://www.infomus.org/)[7][8]. Eye- sWeb was originally conceived as a hardware and software research platform for fast prototyping of interactive sys- tems and for supporting real-time analysis of multimodal data streams, based on a visual programming language. The platform includes libraries for image and video processing (e.g., background segmentation and motion tracking), for audio processing, for analysis of full-body movement and gesture of single and multiple users, and for analysis of social interaction within groups. It supports a broad range of input and output devices. Recently, EyesWeb has been extended with support to novel commercially available devices, such as Microsoft Kinect, and with a collection of new modules for captur- ing and analysing 3D data. EyesWeb is freely available on the Internet. It is being used in several application areas, such as active experience of cultural heritage and networked media, performing arts, education and technology-enhanced learning, therapy and rehabilita- tion, and it was adopted as research and development platform in several EU projects. In particular, the work presented in this paper was carried out in the frame- work of the EU-ICT-FET project SIEMPRE (Social Interaction and Entrainment using Music PeRformance Experimentation, http://siempre.infomus.org/) and was exploited in the framework of the EU-ICT-FET project ILHAIRE (Incorporating Laughter into Human Avatar Interactions: Research and Experiments, http: //www.ilhaire.eu/). EyesWeb XMI, as a platform for synchronised recording and analysis of multimodal data for scientific measurements provides a significant research and development tool for multimodal scene analysis. The remainder of this paper is organised as follows: the next section introduces the major extensions to the EyesWeb platform for enabling synchronised recordings of multimodal data. Then, a couple of examples of use in concrete experimental scenarios are provided. EyesWeb XMI as a platform for synchronised recordings In order to enable synchronised recordings of multimodal data for scientific measurements, EyesWeb XMI was extended under several aspects: - New software modules supporting input from mul- tiple audio sources, multiple video sources, motion capture (MoCap) systems, low-cost RGB-D sensors, biometric sensors. In particular, synchronisation is achieved by using SMPTE time-codes (Society of Motion Picture and Television Engineers, from the AIA-DAGA 2013 Merano 709

EyesWeb XMI: a platform for recording and real-time …pub.dega-akustik.de/AIA_DAGA_2013/data/articles/000457.pdfEyesWeb XMI: a platform for recording and real-time analysis of multimodal

Embed Size (px)

Citation preview

EyesWeb XMI: a platform for recording

and real-time analysis of multimodal data streams

Gualtiero Volpe, Paolo Coletta, Simone Ghisio, Antonio CamurriCasa Paganini - InfoMus, DIBRIS - University of Genova, 16145 Genova, Italy, Email: [email protected]

IntroductionAuditory scene analysis and, in general, multimodalscene analysis can take great benefit from the availabilityof flexible tools enabling fast development of prototypesfor collecting and recording multimodal data, previewing,and analysing it. Fast prototype deployment allows forimplementation and test of algorithms without goingthrough time-consuming and expensive development cy-cles. Whereas general-purpose tools such as, for example,Mathworks’ Simulink exist since long time, platformsespecially devoted to (possible real-time) analysis ofmultimodal signals are far less common.

Max [1] is a platform and a visual programming languagefor music and multimedia, originally developed by MillerPuckette at IRCAM, Paris, and nowadays developed andmaintained by Cycling ’74. Conceived for sound andmusic processing in interactive computer music, it wasalso endowed with packages for real-time video, 3-D, andmatrix processing. Pd (Pure Data) [2] is similar in scopeand design to Max. It also includes a visual programminglanguage and it is intented to support development ofinteractive computer music applications. The additionof GEM (Graphics Environment for Multimedia) enablesreal-time generation and processig of video, OpenGLgraphics, images, and so on. Moreover, Pd is nativelydesigned to enable live collaboration across networksor the Internet. vvvv [3] is a hybrid graphical/textualprogramming environment for easy prototyping and de-velopment. It has a special focus on real-time videosynthesis and it is designed to facilitate the handlingof large media environments with physical interfaces,real-time motion graphics, audio and video that caninteract with many users simultaneously. Isadora [4] is aninteractive media presentation tool created by composerand media-artist Mark Coniglio. It mainly includes videogenerarion, processing, and effects and is intentend tosupport artists in developing interactive performances.In the same field of performing arts, Eyecon [5] aims atfacilitating interactive performances and installations inwhich the motion of human bodies is used to trigger orcontrol various other media, e.g., music, sounds, photos,films, lighting changes, and so on. The Social SignalInterpretation framework (SSI ) [6] offers tools to record,analyse and recognize human behavior in real-time, suchas gestures, mimics, head nods, and emotional speech.Following a patch-based design, pipelines are set upfrom autonomic components and allow the parallel andsynchronised processing of sensor data from multipleinput devices.

This paper introduces recent advances in the EyesWeb

XMI platform (http://www.infomus.org/)[7][8]. Eye-sWeb was originally conceived as a hardware and softwareresearch platform for fast prototyping of interactive sys-tems and for supporting real-time analysis of multimodaldata streams, based on a visual programming language.The platform includes libraries for image and videoprocessing (e.g., background segmentation and motiontracking), for audio processing, for analysis of full-bodymovement and gesture of single and multiple users,and for analysis of social interaction within groups. Itsupports a broad range of input and output devices.Recently, EyesWeb has been extended with support tonovel commercially available devices, such as MicrosoftKinect, and with a collection of new modules for captur-ing and analysing 3D data. EyesWeb is freely availableon the Internet. It is being used in several applicationareas, such as active experience of cultural heritageand networked media, performing arts, education andtechnology-enhanced learning, therapy and rehabilita-tion, and it was adopted as research and developmentplatform in several EU projects. In particular, the workpresented in this paper was carried out in the frame-work of the EU-ICT-FET project SIEMPRE (SocialInteraction and Entrainment using Music PeRformanceExperimentation, http://siempre.infomus.org/) andwas exploited in the framework of the EU-ICT-FETproject ILHAIRE (Incorporating Laughter into HumanAvatar Interactions: Research and Experiments, http://www.ilhaire.eu/). EyesWeb XMI, as a platform forsynchronised recording and analysis of multimodal datafor scientific measurements provides a significant researchand development tool for multimodal scene analysis.

The remainder of this paper is organised as follows:the next section introduces the major extensions to theEyesWeb platform for enabling synchronised recordingsof multimodal data. Then, a couple of examples of usein concrete experimental scenarios are provided.

EyesWeb XMI as a platform forsynchronised recordingsIn order to enable synchronised recordings of multimodaldata for scientific measurements, EyesWeb XMI wasextended under several aspects:

- New software modules supporting input from mul-tiple audio sources, multiple video sources, motioncapture (MoCap) systems, low-cost RGB-D sensors,biometric sensors. In particular, synchronisation isachieved by using SMPTE time-codes (Society ofMotion Picture and Television Engineers, from the

AIA-DAGA 2013 Merano

709

name of the society who defined that), a protocolthat is used to synchronise film, video and audiomaterial. To this aim EyesWeb XMI was endowedwith a SMPTE decoder module receiving an audiosignal as input and decoding the SMPTE time-codecontained in the signal and an SMPTE encodermodule generating an audio signal with an encodedtime-code.

- New datatypes for floating-point operationsexplicitly handling significant digits, rounding, andarithmetic operations with significant digits. Suchdatatypes are used with new software modulesenabling generation of floating-point values with agiven number of significant digits and with othermodules (or extension of existing one) performingoperations taking into account significant digits.

- New specific user interfaces for recording and forpreviewing recorded data. These are used to controlin real-time EyesWeb XMI and can also run onmobile devices.

Figure 1 shows a typical set-up for the platform. In orderto instantiate the disucssion on a concrete example, thefollowing description makes reference to the recordingsof a music performance, including video, audio, motioncapture, and biometric signals. In the set-up of Figure 1,EyesWeb XMI runs on three workstations. The one onthe bottom right side of the figure is responsible for audiorecordings. It is endowed with a professional audio card,receiving the audio signals captured by both microphonesplaced on each single music instrument and ambientmicrophones (light grey arrows). The workstation onthe bottom-left side carries out video recordings. It isendowed with a high-quality frame grabber, connectedwith a professional video-camera providing images ata high time and space resolution (green arrow). Theworkstation on the top of the figure is devoted to recordMoCap data provided by an optical MoCap system (aseven-cameras system in the specific case of the figure,connections are displayed as grey lines). The MoCapsystem receives a clock signal from an audio card (brownarrow). Such an audio card is connected with a clockgenerator which also provide the video-camera with aclock signal (purple arrows). Depending on the clock,the audio card generates an SMPTE time-code, whichis sent to all the recording devices (blue arrows) andwhich is used to keep everything synchronised. Eachmultimodal data is stored with an SMPTE time stampas the data is saved to a file. In such a way, EyesWebmakes it possible to operate a synchronisation procedureto acquire different signals from the different systemsworking at different sample rates, with different internalclocks and possible drifts and delays in data acquisition.Such synchronisation can be achieved in real-time, buta more robust synchronisation is achieved after therecording, based on the SMPTE synchronisation streams.

Figure 2 shows a Graphical User Interface (GUI) de-signed for previewing recorded data. It concerns mul-timodal recordings of a string quartet (Quartetto di

Figure 1: A typical set-up for recordings of multimodal datain the framework of a music performance. EyesWeb XMIruns on three workstations enabling synchronised recordingsof audio, video, MoCap, and biometric signals.

Cremona). On the left side of the GUI, video is displayed.Radio buttons allow for selecting the video source to bedisplayed. In this specific case two video cameras wereused (video 1 and video 2) and the 3D visualisation of theskeletons obtained from MoCap data is also available. Incase such 3D visualisation is selected, the visualisationplan (higher or lower) can also be set by using two moreradio buttons. Moreover, the controls at the centre ofthe interface enable changing the point of view in the 3Dworld, by rotating and/or translating it around/alongthe three axes. On the right side of the GUI, anexample of biometric sensor data is displayed. Thisis electrocardiography (ECG) of one of the player andit is visualised in sync with video and audio. On thebottom of the right side of the GUI, radio buttons enablethe user to select the audio channel to be reproduced.Five options are available for this recording: one foreach of the four music instruments in the quartet (firstand second violin, viola, and cello), and an the audiocaptured by an ambient microphone. It is also possibleto remove audio (mute option) and to control the volumeof the reproduced audio. Finally, the SMPTE time-code used for synchronising data streams is displayed inthe middle of the right side of the GUI. The previewinterface was developed with EyesWeb mobile, a toolenabling fast prototyping of GUIs to be connected withthe EyesWeb XMI platform. EyesWeb Mobile GUIs canrun on desktop and laptop computers as well as on mobiledevices.

AIA-DAGA 2013 Merano

710

Figure 2: A Graphical User Interface (GUI) for synchronisedpreviewing of multimodal data captured during a recordingsession. This is implemented using EyesWeb Mobile.

Examples of recordingsThis section provides two concrete examples of recordingsof multimodal data performed with the EyesWeb XMIplatform. The first recording session was carried out inthe framework of the EU-ICT-FET project SIEMPREand concerns multimodal recordings of the music per-formance of a string quartet (Quartetto di Cremona).Music was performed in different conditions, aiming atinvestigating the role of the first violin as leader of themusic ensemble. In one condition, musicians played amusic piece like in a concert. In another condition, thefirst violin modified his usual interpretation by addingrhythmic and dynamic changes unexpected to the othermusicians (e.g., playing forte where the written agogicsis piano, speeding up when a rallentando is requested).The other members of the quartet were not aware of thesenew versions before playing. Recordings consisted of:(i) the sound produced by the musicians captured viapiezoelectric pickups and conventional microphones,(ii)the sound-producing gestures of each musician capturedby means of a professional video-camera and an opticalmotion capture system (Qualysis, seven video-cameras).Both audio, video, and MoCap data are acquired si-multaneously and synchronised in real time using theset-up displayed in Figure 1. Individual audio for eachmusician is captured through the use of piezoelectricpickups attached to the bridge of the instrument. Theoverall sound of the ensemble is captured using a cardioidmedium diaphragm condenser microphone, as well asa binaural stereo recording dummy head; the binauralstereo recording is then used to set the gain of eachindividual pickup signal so that the audio level of eachinstrument corresponds to the overall acoustic result.Figure 3 shows the ongoing recording session.

The second example refers to a recording session carriedout in the framework of the EU-ICT-FET ILHAIREproject. The project aims at analysing and synthesisinglaughter with the final aim of creative more effectiveand natural virtual agents. In order to collect data forlaughter analysis several recordings have been performedalong the project. In one of them, carried out in Parisin November 2012, six groups of three participants wererecorded while performing different tasks, intended to

Figure 3: A recording session of a string quartet (Quartettodi Cremona). Recorded multimodal data included audio,video and MoCap data (Genova, Italy, September 2011)

stimulate laughter in the group (e.g., watching funnyvideos). Recordings included:

- Three inertial MoCap systems: 2 out of 3 partic-ipants were captured using Xsens MVN Biomechproviding output at 120 fps; the third participantwas recorded using Animazoo IGS-190 providingoutput at 60 fps.

- Two Kinect motion sensors to track face, head andbody movements (640x480, 30 fps).

- Six web cameras: four Logitech Webcam Pro 9000- 640x480, 30 fps; two Philips PC webcam SPZ5000- 640x480, 60 fps) to record the whole interactionarea and close-up videos of participants’ faces wherepossible. Video data was also collected from twoKinects.

- Sound from each participant through personal wire-less microphones (mono, 16 kHz).

- Respiration activity by means of a respiration sensor(ProComp Infiniti, Thought Technology) capturingthoracic and abdominal circumference of one partic-ipant at 256 samples/second

Recordings were performed by connecting the EyesWebXMI and SSI [6] platforms. Tools were developed inEyesWeb to play back, annotate, segment, and perform apreliminary processing of data. For example, multimodaldata associated to manually identified segments contain-ing relevant laugther episodes have been automaticallyextracted and synchronised by means of an EyesWeb ap-plication. Figure 4 shows a snapshot from the recordingsession.

ConclusionThis paper presented EyesWeb XMI as a platform forperforming and managing synchronised recording of mul-timodal data. EyesWeb was currently adopted for thispurpose in several research projects as those mentionedin this paper. Nevertheless, this is one specific extension

AIA-DAGA 2013 Merano

711

Figure 4: A recording session of a string quartet (Quartettodi Cremona). Recorded multimodal data included audio,video and MoCap data (Genova, Italy, September 2011)

of EyesWeb whose main goal remains to support real-time analysis of synchronised multimodal data streamsand fast development of prototypes and applicationsexploiting such analysis capabilities.

AcknowledgementsThis development of the software modules for multimodalrecordings in EyesWeb was partially supported by theEU-ICT-FET project SIEMPRE (Social Interaction andEntrainment using Music PeRformance Experimenta-tion, http://siempre.infomus.org/). SIEMPRE ac-knowledges the financial support of the Future andEmerging Technologies (FET) programme within theSeventh Framework Programme for Research of theEuropean Commission, under FET-Open grant num-ber 250026-2. The multimodal recordings of laughterepisodes has received funding from the European UnionSeventh Framework Programme (FP7/2007-2013) undergrant agreement number 270780 (ILHAIRE, Incorporat-ing Laughter into Human Avatar Interactions: Researchand Experiments, http://www.ilhaire.eu/). The au-thors thank Donald Glowinski who participated in therecordings of the string quartet, Maurizio Mancini andGiovanna Varni who participated in the recordings ofthe laughter episodes, the whole staff of Casa Paganini -InfoMus for support and discussion, and the partners inthe above mentioned EU projects.

References[1] Reference to the Max homepage.

URL: http://cycling74.com/products/max/

[2] Puckette, M. S.: Pure Data.In Proceedings International Computer Music Con-ference, 1996, 269-272.

[3] Reference to the vvvv homepage.URL: http://vvvv.org/

[4] Reference to the Isadora homepage.URL: http://troikatronix.com/

[5] Reference to the Eyecon homepage.URL: http://eyecon.palindrome.de/

[6] Wagner, J., Lingenfelser, F., Andre, E.: The SocialSignal Interpretation Framework (SSI) for Real TimeSignal Processing and Recognition.In Proceedings Interspeech 2011, 2011.

[7] Camurri, A., Hashimoto, S., Ricchetti, M., Trocca,R., Suzuki, K., Volpe, G.: EyesWeb - Toward Gestureand Affect Recognition in Interactive Dance andMusic Systems.Computer Music Journal 24(1) (2000), 57-69

[8] Camurri, A., Coletta, P., Demurtas, M., Peri, M.,Ricci, A., Sagoleo, R., Simonetti, M., Varni, G.,Volpe, G.: A Platform for Real-Time MultimodalProcessing.Proceedings International Conference Sound andMusic Computing 2007, 2007, 354-358.

AIA-DAGA 2013 Merano

712