Download pdf - Real-Time Feedback System for Monitoring and Facilitating ...imi.ntu.edu.sg/NewsEvents/Events/PastSeminars/Documents/Yasir_T… · Real-Time Feedback System for Monitoring and Facilitating

Real-Time Feedback System for Monitoring and Facilitating Discussions

Supervisor: Prof Justin DAUWELS School of Electrical and Electronic Engineering, NTU

Co-Supervisor: Prof Daniel THALMANN Institute for Media Innovation (IMI), NTU

1

PhD Student: Yasir Tahir Institute for Media Innovation (IMI), NTU

Introduction

2

• Social Signal Processing

• Is the research and technological domain that aims at providing computers with the ability to sense and understand human social signals.

• Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life.

• Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues development of automated systems for Social Signal Processing (SSP) is rather challenging.

Vinciarelli, Alessandro, Maja Pantic, and Hervé Bourlard. "Social signal processing: Survey of an emerging domain." Image and Vision Computing27.12 (2009): 1743-1759.

Outline

1. General Objective 2. Speech Analysis 3. Video Analysis 4. Data Collection 5. Feedback

3

Enhancing Social Cognition

4

Posture

Gesture

Appearance

Speech

Distance

Facial expression

Behavioral cues (e.g. pitch, volume, turn-taking , posture)

Useful feedback for more effective conversations

Applications • Mentoring for leadership development • Team meetings • Interviewing systems

Functional Blocks of the System

5

Outline


5

Speech Analysis

6

• Four types of vocal social signalling :

• Activity level • Engagement • Stress • Mirroring

Pentland, Alex. "Social dynamics: Signals and behaviour." International Conference on Developmental Learning. Vol. 5. 2004.

Speech Features

6

Speech Analysis: Setup

7

Recording setup using the Zoom H4n voice recorder with Sennheiser e845s microphones for two people conversation.

8

Speech Analysis: GUI

GUI for real-time analysis in MATLAB

Outline

9


Video Analysis: Setup

10

• For video analysis we are using Kinect sensor. It provides a wide range of sensors which can be used to extract many useful features.

• Physical Capabilities – Angles of Kinect vision (Depth and RGB)

• Horizontal: 57.5 degrees • Vertical: 43.5 degrees with -27 to +27 degree tilt range up and down • 1.2 to 3.5m distance range for depth • Microphone array

Video Analysis: Features

11

• We are extracting following features from the video data :-

– Posture – Gesture usage – Nodding – Audio-visual speech detection

Video Analysis: Detection

12

• Nodding detection

– We are detecting consecutive vertical head movement. Such movement is classified as YES or vertical nodding. – Similarly consecutive horizontal head movement is detected and such motion is classified as NO or horizontal nodding.


13

• Posture detection

– Posture detection is done using the angle between head and shoulders of a person. From this information we can detect the posture. – Right now we have implemented three basic sitting postures i.e. upright, leaned back and hunched forward.


15

• Speech detection

– Speech detection is done using lip motion in the detected face. – Fusing the audio data for speech detection really improves the speech detection and reduces false detections.


15

• Gesture Usage

– Instead of looking for certain gestures, we are calculating the overall hand movement. Higher value of this measure represent greater use of gestures, whereas low value represents less gesture use.

Outline

16


Data Collection

17

• We have recorded 50 sessions of two person discussions. The duration of each session is five minutes. The topics are everyday issues ranging from social issues to movies.

• Four people two males and two females participated in the data collection.

• Audio data was recorded using lapel microphones. • Video and depth data was recorded using kinect sensor, one for each

person. • We focused on meeting scenarios :

– Participants were sitting – we did not ask the participants to restrict their movements to keep the social

aspect intact.

Data Collection: Example

18

Data Collection: Results

16

Distance of Speakers

Seating Arrangement

Ambient Noise

No of Samples

Undetected Speech

False Detection Interference Overall Accuracy

0.4-0.5m Side Low 19200k 6.5% 4% 9% 80.5%

0.7-0.8m Side Low 19200k 2.0% 2.5% 3.5% 92%

1.0m Front Low 19200k 2.5% 2% 12% 84.5%

1.2m Front Low 19200k 2.0% 0.0% 1% 97%

1.5m Front Low 19200k 0.5% 0.5% 0.5% 98.5%

1.8m Front Low 19200k 2.0% 0.0% 0.0% 98.0%

1.5-1.8m Front High 19200k 0.5% 1.5% 1.0% 97%

Visual Cue Video duration False detection Accuracy

Posture 500 min 5% 95%

Gesture 500 min 13% 87%

• For the case of visual cues. We analyzed the video recorded in 50 sessions

Outline

20


Methodology

Behavior Detection: Dominance, Interest, Discord, Consistency, Mirroring via

Support Vector Machines (SVM)

Speech Cues Visual Cues Feature Extraction

Feedback

Audio and Video Data Acquisition

Pre-processing Pre-processing Step for Audio: Speech Detection

Pre-processing Step for Video: Face and Skeleton Detection

Social Roles

• In real life situations, a combination of the honest signals can be observed in individuals. Four core groups of social roles - Listening, teaming, exploring and leading have been identified. • Listening is said to signal a combination of attentive interest and

openness to ideas - Variable emphasis and suppressed activity. • Teaming involves a combination of attention, empathetic understanding

and focused thought and purpose - High influence, mimicry and consistent emphasis.

• Exploring represents possibility of establishing a meaningful relationship with someone for which high levels of interest and openness to influence is necessary - High activity, variable emphasis and rhythm.

• Leading displays attention, interest and great focus in thought and purpose - High influence and activity levels and consistent emphasis.

Pentland, Alex Sandy. Honest signals: how they shape our world. MIT Press, 2008.

22

SVM output Dominance

High Dominance 23

Presenter

Presentation Notes

Difference of Natural Turns, Difference of Speaking

SVM output Discord

High Discord

24

Presenter

Presentation Notes

Interruption, Total Overlap, Mutual Silence%: Discord

SVM output Interest

Low Interest

High Interest

25

Presenter

Presentation Notes

Speaking %, Turn Duration, Mutual Silence%: Interest

Real-Time Feedback

• The concept is to analyse acquired audio and video data and

provide feedback during an ongoing conversation. • We use certain time window, after each window the data is

analysed and a feedback is generated if required. • We also plan on observing the effect of this feedback on the

speakers. • The medium to provide feedback is also very important. It

should not be too abstract or distracting that the user gets disturbed.

26

Feedback Platforms

• Social Mediator via Nao • Retrospective Feedback via Avatar Animation • Socio-Feedback via Skype Application • Socio-Feedback via Android Application • Socio-Feedback via Vuzix

27

Applications

Speech Coach

Interview Analysis + Guidance Team Facilitation

Monitoring of mental state (e.g., stress, alertness, concentration)

28

Q & A

Q&A

29