Real-Time Feedback System for Monitoring and Facilitating Discussions
Supervisor: Prof Justin DAUWELS School of Electrical and Electronic Engineering, NTU
Co-Supervisor: Prof Daniel THALMANN Institute for Media Innovation (IMI), NTU
1
PhD Student: Yasir Tahir Institute for Media Innovation (IMI), NTU
Introduction
2
• Social Signal Processing
• Is the research and technological domain that aims at providing computers with the ability to sense and understand human social signals.
• Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life.
• Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues development of automated systems for Social Signal Processing (SSP) is rather challenging.
Vinciarelli, Alessandro, Maja Pantic, and Hervé Bourlard. "Social signal processing: Survey of an emerging domain." Image and Vision Computing27.12 (2009): 1743-1759.
Outline
1. General Objective 2. Speech Analysis 3. Video Analysis 4. Data Collection 5. Feedback
3
Enhancing Social Cognition
4
Posture
Gesture
Appearance
Speech
Distance
Facial expression
Behavioral cues (e.g. pitch, volume, turn-taking , posture)
Useful feedback for more effective conversations
Applications • Mentoring for leadership development • Team meetings • Interviewing systems
Functional Blocks of the System
5
Outline
1. General Objective 2. Speech Analysis 3. Video Analysis 4. Data Collection 5. Feedback
5
Speech Analysis
6
• Four types of vocal social signalling :
• Activity level • Engagement • Stress • Mirroring
Pentland, Alex. "Social dynamics: Signals and behaviour." International Conference on Developmental Learning. Vol. 5. 2004.
Speech Features
6
Speech Analysis: Setup
7
Recording setup using the Zoom H4n voice recorder with Sennheiser e845s microphones for two people conversation.
8
Speech Analysis: GUI
GUI for real-time analysis in MATLAB
Outline
9
1. General Objective 2. Speech Analysis 3. Video Analysis 4. Data Collection 5. Feedback
Video Analysis: Setup
10
• For video analysis we are using Kinect sensor. It provides a wide range of sensors which can be used to extract many useful features.
• Physical Capabilities – Angles of Kinect vision (Depth and RGB)
• Horizontal: 57.5 degrees • Vertical: 43.5 degrees with -27 to +27 degree tilt range up and down • 1.2 to 3.5m distance range for depth • Microphone array
Video Analysis: Features
11
• We are extracting following features from the video data :-
– Posture – Gesture usage – Nodding – Audio-visual speech detection
Video Analysis: Detection
12
• Nodding detection
– We are detecting consecutive vertical head movement. Such movement is classified as YES or vertical nodding. – Similarly consecutive horizontal head movement is detected and such motion is classified as NO or horizontal nodding.
Video Analysis: Detection
13
• Posture detection
– Posture detection is done using the angle between head and shoulders of a person. From this information we can detect the posture. – Right now we have implemented three basic sitting postures i.e. upright, leaned back and hunched forward.
Video Analysis: Detection
15
• Speech detection
– Speech detection is done using lip motion in the detected face. – Fusing the audio data for speech detection really improves the speech detection and reduces false detections.
Video Analysis: Detection
15
• Gesture Usage
– Instead of looking for certain gestures, we are calculating the overall hand movement. Higher value of this measure represent greater use of gestures, whereas low value represents less gesture use.
Outline
16
1. General Objective 2. Speech Analysis 3. Video Analysis 4. Data Collection 5. Feedback
Data Collection
17
• We have recorded 50 sessions of two person discussions. The duration of each session is five minutes. The topics are everyday issues ranging from social issues to movies.
• Four people two males and two females participated in the data collection.
• Audio data was recorded using lapel microphones. • Video and depth data was recorded using kinect sensor, one for each
person. • We focused on meeting scenarios :
– Participants were sitting – we did not ask the participants to restrict their movements to keep the social
aspect intact.
Data Collection: Example
18
Data Collection: Results
16
Distance of Speakers
Seating Arrangement
Ambient Noise
No of Samples
Undetected Speech
False Detection Interference Overall Accuracy
0.4-0.5m Side Low 19200k 6.5% 4% 9% 80.5%
0.7-0.8m Side Low 19200k 2.0% 2.5% 3.5% 92%
1.0m Front Low 19200k 2.5% 2% 12% 84.5%
1.2m Front Low 19200k 2.0% 0.0% 1% 97%
1.5m Front Low 19200k 0.5% 0.5% 0.5% 98.5%
1.8m Front Low 19200k 2.0% 0.0% 0.0% 98.0%
1.5-1.8m Front High 19200k 0.5% 1.5% 1.0% 97%
Visual Cue Video duration False detection Accuracy
Posture 500 min 5% 95%
Gesture 500 min 13% 87%
• For the case of visual cues. We analyzed the video recorded in 50 sessions
Outline
20
1. General Objective 2. Speech Analysis 3. Video Analysis 4. Data Collection 5. Feedback
Methodology
Behavior Detection: Dominance, Interest, Discord, Consistency, Mirroring via
Support Vector Machines (SVM)
Speech Cues Visual Cues Feature Extraction
Feedback
Audio and Video Data Acquisition
Pre-processing Pre-processing Step for Audio: Speech Detection
Pre-processing Step for Video: Face and Skeleton Detection
Social Roles
• In real life situations, a combination of the honest signals can be observed in individuals. Four core groups of social roles - Listening, teaming, exploring and leading have been identified. • Listening is said to signal a combination of attentive interest and
openness to ideas - Variable emphasis and suppressed activity. • Teaming involves a combination of attention, empathetic understanding
and focused thought and purpose - High influence, mimicry and consistent emphasis.
• Exploring represents possibility of establishing a meaningful relationship with someone for which high levels of interest and openness to influence is necessary - High activity, variable emphasis and rhythm.
• Leading displays attention, interest and great focus in thought and purpose - High influence and activity levels and consistent emphasis.
Pentland, Alex Sandy. Honest signals: how they shape our world. MIT Press, 2008.
22
SVM output Dominance
High Dominance 23
SVM output Discord
High Discord
24
SVM output Interest
Low Interest
High Interest
25
Real-Time Feedback
• The concept is to analyse acquired audio and video data and
provide feedback during an ongoing conversation. • We use certain time window, after each window the data is
analysed and a feedback is generated if required. • We also plan on observing the effect of this feedback on the
speakers. • The medium to provide feedback is also very important. It
should not be too abstract or distracting that the user gets disturbed.
26
Feedback Platforms
• Social Mediator via Nao • Retrospective Feedback via Avatar Animation • Socio-Feedback via Skype Application • Socio-Feedback via Android Application • Socio-Feedback via Vuzix
27
Applications
Speech Coach
Interview Analysis + Guidance Team Facilitation
Monitoring of mental state (e.g., stress, alertness, concentration)
28
Q & A
Q&A
29