59
Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation Center (Athena RIC) Petros Maragos and Athanasia Zlatintsi 1 Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018 slides: http://cvsp.cs.ntua.gr/interspeech2018 Part 2: Audio-Visual HRI: Methodology and Applications in Assistive Robotics

Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

Computer Vision, Speech Communication & Signal Processing Group,Intelligent Robotics and Automation Laboratory

National Technical University of Athens, Greece (NTUA)Robot Perception and Interaction Unit,

Athena Research and Innovation Center (Athena RIC)

Petros Maragos and Athanasia Zlatintsi

1

Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018

slides: http://cvsp.cs.ntua.gr/interspeech2018

Part 2:Audio-Visual HRI: Methodology and

Applications in Assistive Robotics

Page 2: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

2Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

2A.Audio-Visual HRI:

General Methodology

Page 3: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

3Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multimodal HRI: Applications and Challenges

education, entertainment

assistive robotics

Challenges Speech: distance from microphones, noisy acoustic scenes, variabilities Visual recognition: noisy backgrounds, motion, variabilities Multimodal fusion: incorporation of multiple sensors, integration issues Elderly users, Children

Page 4: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

4Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Database of Multimodal Gesture Challenge (in conjunction with ACM ICMI 2013)

• 20 cultural/anthropological signs of Italian language

• ‘vattene’ (get out)• ‘vieni qui’ (come here)• ‘perfetto’ (perfect)• ‘furbo’ (clever)• ‘che due palle’ (what a nuisance!)• ‘che vuoi’ (what do you want?)• ‘d’accordo’ (together)• ‘sei pazzo’ (you are crazy)• ‘combinato’ (combined)• ‘freganiente’ (damn)

• ‘ok’ (ok)• ‘cosa ti farei’ (what would I make to you!)• ‘basta’ (that’s enough)• ‘prendere’ (to take)• ‘non ce ne piu’ (there is none more)• ‘fame’ (hunger)• ‘tanto tempo’ (a long time ago)• ‘buonissimo’ (very good)• ‘messi d’accordo’ (agreed)• ‘sono stufo’ (I am sick)

• 22 different users• 20 repeats per user approximately (~1 minute for each gesture video)

Database of Multimodal Gesture Challenge

Page 5: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

5Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Depth(vieniqui ‐ come here)

User Mask(vieniqui - come here)

Skeleton(vieniqui - come here)

RGB Video & Audio

Multimodal Gesture Signals from Kinect‐0 Sensor

[S. Escalera, J. Gonzalez, X. Baro, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante, “Multi-modal gesture recognition challenge 2013: Dataset and results”, Proc. 15th ACM Int’l Conf. Multimodal Interaction, 2013.]

ChaLearncorpus

Page 6: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

6Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multimodal Hypothesis Rescoring + Segmental Parallel Fusion

N‐best list generation

audio

skeleton N‐best list generation

handshape N‐best list generation

multiple hypotheses list rescoring & resorting

bestsingle‐stream hypotheses

bestmultistreamhypothesis parallel 

segmental fusion

single‐stream models

recognized gesture sequence

[V. Pitsikalis, A. Katsamanis, S. Theodorakis & P. Maragos, “Multimodal Gesture Recognition via Multiple Hypotheses Rescoring”, JMLR 2015]

Page 7: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

7Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Visual Fusion & Recognition

Audio and visual modalities for an A-V word sequence. Ground truth transcriptions (“REF”) and decoding results

for audio and 3 different fusion schemes.

Audio and visual modalities for A-V gesture word sequence. Ground truth transcriptions (“REF”) and decoding results for audio and 3 different

A-V fusion schemes. Results in top rank of ChaLearn (ACM 2013 Gesture Challenge – 50 teams -

22 users x 20 gesture phrases x 20 repeats). [ V. Pitsikalis, A. Katsamanis, S. Theodorakis & P. Maragos, JMLR 2015 ]

Page 8: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

8Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual Activity Recognition

Action: sit to stand

������ ���� ���

Gestures: come here, come near

Sign: (GSL) Europe

Page 9: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

9Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual action recognition pipeline

TemporalSlidingWindow

VisualFeature

ExtractionSliding

Window

VisualFeature

Extraction

Classifier

Post-processing

Video

Classifier

RecognizedSequence

Sit to Stand Walk

������������

Page 10: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

10Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Example video

Visual Front-End

Optical Flow

Example video

VideoDense Trajectories

Feature Descriptors

Example video

Page 11: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

11Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Features: Dense Trajectories

1. Feature points are sampled on aregular grid in multiple scales

2. Feature points are tracked through consecutive video frames

3. Descriptors are computed in space-time volumes along trajectories

[ Wang et al.IJCV 2013 ]

Page 12: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

12Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

K-means Clustering and Dictionary

FeatureSamples

K-means Dictionary

Page 13: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

VLAD

BOF ‐ Size: K

VLAD ‐ Size: K*D

FeatureEncoding

Page 14: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

14Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual Action Classification

Classifier

LabeledVideos

Train

Test

UnlabeledVideos

Labels

Page 15: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

15Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Temporal Segmentation Results

15

Ground Truth

SVM

Ground Truth

SVM+Filter+HMM_Viterbi

SitWalkStand - B.M.

Page 16: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

16Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Action Recognition Results (4a, 6p): Descriptors + Post-processing Smoothing

Dense Trajectories + BOF Encoding

Results improve by adding Depth and/or advanced Encoding

Page 17: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

17Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Gesture Recognition

Page 18: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

18Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Gesture Recognition ChallengesChallenging task of recognizing human gestural movements:

• Large variability in gesture performance.• Some gestures can be performed with left or right hand.

I want to Perform a Task

I want to Sit Down

Park

Come Closer

Page 19: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

19Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual Gesture Classification Pipeline

Class Probabilities

(SVM scores)

Page 20: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

20Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Applying Dense Trajectorieson Gesture Data

Page 21: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

21Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Extended Results on Gesture Recognition

0

10

20

30

40

50

60

70

80

traject. HOG HOF MBH combined

accuracy (%

)Comparisons: Multiple descriptors, Multiple encodings;  Mean over patients

BoVW VLAD Fisher

MOBOT‐I, Task 6a (8g, 8p) 

Page 22: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

22Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Visual Synergy: Semantic Segmentation + Gesture Recognition

foreground/background+gesture recognitionA. Guler, N. Kardaris, S. Chandra, V. Pitsikalis, C. Werner, K. Hauer, C. Tzafestas, P. Maragos and I. Kokkinos, “Human Joint Angle Estimation and Gesture Recognition for Assistive Robotic Vision” ECCV Workshop on Assistive Computer Vision and Robotics, 2016.

Median Relative Improve: 9 %

Page 23: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

23Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Spoken Command Recognition

Page 24: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

24Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Distant Speech Recognition inVoice-enabled Interfaces

Noise

Other Speech

Distant MicrophonesReverberation

https://dirha.fbk.eu/

Page 25: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

25

Smart Home Voice Interface

Sweet home listen! Turn on the lights in the living room!

Main technologies: Voice Activity

Detection Acoustic Event

Detection Speaker Localization Speech Enhancement Keyword Spotting Far-field command

recognition

Page 26: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

26

DIRHA demo (“spitaki mou”)

• I. Rodomagoulakis, A. Katsamanis, G. Potamianos, P. Giannoulis, A. Tsiami, P. Maragos, “Room-localized spoken command recognition in multi-room, multi-microphone environments”, ComputerSpeech & Language, 2017.

• A. Tsiami, I. Rodomagoulakis, P. Giannoulis, A. Katsamanis, G. Potamianos and P. Maragos,“ATHENA: A Greek Multi-Sensory Database for Home Automation Control”, INTERSPEECH 2014.

https://www.youtube.com/watch?v=zf5wSKv9wKs

Page 27: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

27Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Spoken-Command Recognition Module for HRI

VAD KWS ASRAcousticFront-end

HTK tools

Python interface

“Wo bin ich”“Hilfe”“Gehe rechts”….

“robot”

garbage

MEMS

DS beam

forming

ROBOTcommand

integrated in ROS, always-listening mode, real time performance

Page 28: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

28Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Online Spoken Command Recognition Greek, German, Italian, English

1.5 – 3mcommand

genericspeech

sil sil

Segmentationch-1

ch-𝑀

TargetedAcoustic ScenesPentagon

ceiling array(Shure)

MEMS mic array

Kinect mic array

Page 29: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

29Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Visual Fusionfor

Multimodal Gesture Recognition

Page 30: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

30Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multimodal Fusion:Complementarity of Visual and Audio Modalities

Similar audio,distinguishable gesture

Distinguishable audio,similar gesture

Page 31: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

31Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Visual Fusion: Hypotheses Rescoring

speech & gesture recognition

spoken commands hypotheseshypothesis normalized

scoreA1 Help 0.2

A2 Stop 0.19

A3 park 0.12

A19 go straight 0.01

visual gesture hypothesesHypothesis normalized

scoreV1 Stop 0.5

V2 go away 0.15

V3 help 0.12

V19 go straight 0.01

N-b

est

MAX 𝑤 𝑠𝑐𝑜𝑟𝑒 𝐴 𝑤 𝑠𝑐𝑜𝑟𝑒 𝑉 , 𝑤 𝑠𝑐𝑜𝑟𝑒 𝐴 𝑤 𝑠𝑐𝑜𝑟𝑒 𝑉

𝑤 , 𝑤 : modality weights

hypothesis combined score

F1 Stop 0.205

F2 help 0.196

fusion hypotheses

Page 32: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

32Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Offline Multimodal Command Classification Leave-one-out experiments (Mobot-I.6a data: 8p,8g) Unimodal: audio (A) and visual (V) Multimodal (AV): N-best list rescoring

96,87

78,26

79,16

90,32

80 81,81

79,16

87,5

8487,5 100

79,19

96,77

86,66

90

95,83

84,37

90

0

20

40

60

80

100

p1 p4 p7 p8 p9 p11 p12 p13 avg

AVAV

Multimodal confusability graph

Page 33: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

33Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

HRI Online Multimodal System Architecture ROS based integration Spoken command recognition node Activity detection node Gesture classifier node Multimodal fusion node

Communication using ROS messages

Page 34: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

34Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Gestural Command RecognitionOnline processing system – Open Source Software

http://robotics.ntua.gr/projects/building-multimodal-interfaces

34

Gesture Classification

Frontend+ActivityDetector

Background  Models

Gesture Models

recognized visual gesture+ confidence

KeywordSpotting

Background  Models

SpeechModels

Front‐endAudio

Recognition

recognized audio command+ confidence

post‐processfusion

final recognized result

N. Kardaris, I. Rodomagoulakis, V. Pitsikalis, A. Arvanitakis and P. Maragos, A platform for building new human‐computer interface systems that support online automatic recognition of audio‐gestural commands, Proc. ACM Multimedia 2016.

Page 35: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

35Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

2B.Audio-Visual HRI:

Applications in Assistive Robotics

Page 36: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

36Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

EU Project MOBOT: Motivation

Experiments conducted atBethanien Geriatric Center Heidelberg

Mobility & Cognitive impairments,prevalent in elderly population, limitingfactors for Activities of Daily Living(ADLs)Intelligent assistive devices (roboticRollator) aiming to provide context-aware and user-adaptive mobility(walking) assistance

MOBOT rollator

Page 37: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

37Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multi-Sensor Data for HRI

Kinect1 RGB Data Kinect Depth DataKinect1 RGB

Kinect1 DepthMEMS Audio Data

Go Pro RGB Data HD1 Camera Data HD2 Camera Data

Page 38: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

38Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Action Sample Data and Challenges

Visual noise by intruders Multiple subjects in the scene, even in same depth level Frequent and extreme occlusions, missing body parts

(e.g. face) Significant variation in subjects pose, actions, visibility,

background

Stand-to-Sit – P1 Stand-to-Sit – P3 Stand-to-Sit – P4

Page 39: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

39Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Gestural Command Recognition:Overview of our Multimodal Interface

Spoken Command

Recognition

VisualAction-gesture

Recognition MultimodalLate

Fusion

MEMSlinear array

Kinect RGB-Dcamera

MOBOT robotic platformN-best hypotheses

& scores

Best AV Hypothesis

[ I. Rodomagoulakis, N. Kardaris, V. Pitsikalis, E. Mavroudi, A. Katsamanis, A. Tsiami and P. Maragos, ICASSP 2016 ]

Page 40: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

40

Kalamata – Diaplasis (30 patients)

Clinical Studies (ΜΟΒΟΤ)

Heidelberg – Bethanien (19 patients)

Speech, Gestures, Combination: 3 repetitions of 5 commands

Page 41: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

41Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Validation experiments (Bethanien, Heidelberg):Audio-Gestural recognition in action (1/2)

41

Page 42: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

42Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

EU Project I-SUPPORT: Overview (Gesture & Spoken Command Recognition)

ch-1

…ch-M

dense trajectories of visual motion

Page 43: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

43Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Audio-Gestural Recognition: Validation Experiments(FSL, Rome)

Page 44: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

44Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

FSL, Rome

Bethanien,Heidelberg

Validation Setup

Page 45: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

45Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction 45

Data collection

Different viewpoints Poor gesture performance Random movements

KIT ICCS - NTUAPre-Validation

FSL - Bethanien

ChallengesGesture Recognition

Page 46: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

46Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Gesture Recognition – Depth Modality

Experiments with Depth and Log-Depth streams Extraction of Dense Trajectories performs better on the Log-

Depth streamRGB stream

Dense Trajectories

Log-Depth stream

Page 47: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

47Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Gesture Offline Classification – Results

ICCS Dataset (24u, 28g) Two different setups Two different streams Different encoding methods Different features

KIT Dataset (8u, 8/10g) Two different setups Average gesture recognition accuracy: Legs (8 gestures): 83% Back (10 gestures): 75%

FSL Pre-Validation Dataset (5u, 10g) Train/fine-tuning the models for audio-visual gesture recognition Average gesture recognition accuracy for the 5 gestures used in validation: Legs: 85% , Back: 75%

Page 48: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

48Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

● ROS (Robot Operating System) based integration

● Multimodal “late” fusion (Validation @ Bethanien, Heidelberg)

(ROS messageto FSM)

Multimodal Fusion and On-line Integration

Page 49: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

49Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Validation resultsCommand Recognition Rate (CRR)

(= accuracy only on well performed commands)

Round 2(“back” position)

Gesture-onlyscenario

Audio-GesturalScenario

Without training 59.6% 86.2%Withtraining 68.7% 79.1%

Bethanien, Heidelberg

FSL, Rome

Round 2(no training, audio‐gestural scenario, “legs” position)

83.5%

Round 1(no training, audio‐gestural scenario)

Back 73.8% (A)*Legs 84.7%

Round 1(no training, audio‐gestural scenario)

Back 87.2%Legs 79.5%

Page 50: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

50Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

I-SUPPORT system video

Page 51: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

51Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Part 2: Conclusions Synopsis: Multimodal Action Recognition and Human-Robot Interaction Visual Action Recognition Gesture Recognition Spoken Command Recognition Online Multimodal System and Applications in Assistive Robotics

Ongoing work: Fuse Human Localization & Pose with Activity Recognition Activities: Actions – Gestures – SpokenCommands - Gait Applications in Perception and Robotics

Tutorial slides: http://cvsp.cs.ntua.gr/interspeech2018

For more information, demos, and current results: http://cvsp.cs.ntua.gr and http://robotics.ntua.gr

Page 52: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

52

APPENDIX

Page 53: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

53

More Action Recognition results: +Gabor3D+Depth

DT

MOBOT-I.3b (6p, 4a)

Page 54: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

MOBOT Year-2 Review, Bristol, 26 March 2015

Action Recognition results: Comparison of HOG (RGB) vs HOD (Depth)

54Encoding: BOF‐L1

Page 55: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

55

VLAD vs BOF comparison

MOBOT-I Scenario 3b (3 actions + BM, 6 patients).

Page 56: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

56

Overview: Visual Gesture Recognition

Handshape Hand Position& Movement

SpatiotemporalFeatures +Training

ClassificationRecognition

Statistical Modeling

RGB +D Pose Annotation

Anno-tations

Pose Estimation

Page 57: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

57

GoPro Camera Data

“Help”

“I want to stand up”

Page 58: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

58

Gesture Classification results on GoPro data

Mobot-I 6a (8p, 8g)

Page 59: Part 2: Audio-Visual HRI: Methodology and Applications in …cvsp.cs.ntua.gr/interspeech2018/slides/IS2018-Tutorial... · 2018. 9. 18. · Interspeech2018 Tutorial: Multimodal Speech

59

Multimodal Gesture classification onMobot 6.a dataset

Task 6a User is sitting & gesturing 13 patients

GoPro videos & MEMS audio aligned with annotations

Noisy audio-visual scenes Different speech & gesture

pronunciations

IS: Instructor speakingPS: Patient speakingBN: Background Noise