CINEMO – A French Spoken Language Resource for Complex Emotions: Facts and Baselines

Preview:

DESCRIPTION

CINEMO – A French Spoken Language Resource for Complex Emotions: Facts and Baselines. Björn Schuller , Riccardo Zaccarelli, Nicolas Rollet, Laurence Devillers. CNRS-LIMSI Spoken Language Processing Group Orsay , France. Thursday 20th May 2010, 12.25-12.45 PM, O21 - Emotion, Sentiment. - PowerPoint PPT Presentation

Citation preview

CINEMO – A French Spoken Language Resource for Complex Emotions: Facts and BaselinesBjörn Schuller, Riccardo Zaccarelli, Nicolas Rollet, Laurence

DevillersCNRS-LIMSI Spoken Language Processing GroupOrsay, France

Thursday 20th May 2010, 12.25-12.45 PM, O21 - Emotion, Sentiment

• Introduction

• CINEMO Corpus Statistics

• Recognition of Complex Emotions

• Conclusions

Outline

Björn Schuller 2

• Dimensional ModelOrthogonal system:

Arousal, valence, dominance/potency, ...Ideally non-correlated

• Categorical ModelDiscrete affective statese.g. „Big 6“ (Ekman/MPEG-4)Assignable in emotion sphere“Intensity” turns category into dimension

• Complex Emotions“Soft” hit for several categories“Major / minor” emotion

Models of Emotion

Björn Schuller 3

Arousal a

Valence v

e=[v,a]T

1.0-1.0

-1.0

1.0

Surprise

Joy

Anticipation

Acceptance

Neutr alität

Sadness

Disgust

Anger

Fear

Databases – Nine Popular Examples

Björn Schuller 4

Corpus Content # Emotions # Instances h:mm # Subjects Type

ABC German Fixed

6 431 1:15 8 4 f acted

AVIC English variable

I (5) 3002 1:47 21 10 f natural

DES Danish Fixed

5 419 0:28 4 2 f acted

EMO-DB German Fixed

7 494 0:22 10 5 f acted

eNTERFACE English Fixed

6 1277 1:00 42 8 f acted

SAL English variable

A/V 1692 1:41 4 2 f natural

SmartKom German variable

(10) 3823 7:08 79 47 f natural

SUSAS English fixed

(3) 3593 1:01 7 3 f natural

VAM German variable

A/V/D (3x5) 946 0:47 47 32 f Natural

• Size3 992 instances after segmentation2:13:59 h net playtime

• Subjects51 speakers:

21 female (1 656 instances), 30 male (2 336 instances)4 age groupsNone professional actor

• ProtocolDubbing selected scenes from 12 French movies Broad coverage of emotionsSituations close to everyday emotions (Rottenberg et al., 2007)Suited to well induce mood (Gerrards-Hesse et al., 1994)

Corpus Stats and Protocol

Björn Schuller 6

• Good Blend to Cover EmotionsExtrapolation of interpersonal behavior patternsAffective Computing

• Areas of ApplicationInterpretation of the user intentionAccommodation in the communicationObjective measurementTransmission of emotionEmotional adaptationMultimedia RetrievalVideo gaming and entertainmentSurveillanceEncoding

A Dozen Movies

Björn Schuller 7

• “Karaoke”Participants superpose voice on actor’sActor’s voice audible or mutedDialog/pauses shown as a KaraokeCurrent word highlightedSpoken interactions, natural contexts

• Example Scene: “Chaos”Affective state: sadness, disappointment Description: speaker reports

humiliating behavior of boyfriend Involvement’s degree: highly implicatedType of action: storytellingImplied temporalities: recent past

Movies

Björn Schuller 8

• Numbers29 scenes, 1 or 2 players at a time:

14 male, 7 female, 6 mixed gender, 2 female–female scenes31 roles:14 female and 17 male

• Scene RepetitionEach scene could be repeatedNumber of occurrences per attempt:

1 945 (first), 1 518 (second), 433 (third), 84 (fourth), 12 (fifth)Mean number of scene repetition: 1.67

Scenes and Roles

Björn Schuller 9

• N-Gram Frequencies119 turns with 1 609 wordsVocabulary size of 5624.4 graphemes on averageUni-grams “c”’ (this), “est” (is), and “j’ ” (I) > 50 timesBi-gram “c’est” > 10 times

A Linguistic Perspective

Björn Schuller 10

• Sequential ProcessingAt present complete annotation by 2 experienced labelers:𝐿1: male, 31 years; 2: female, 26 years𝐿2 strategies intentionally followed:𝐿1 provided with sequential order, manually segmented audio𝐿2 provided with single instances in random order for verification

• Balanced Segmentation InterestsSyntax, pragmatic, stationarity of major emotionShorter segments preferredPredominant non-linguistic vocalizations as boundariesAfter segmentation:

min. 24, max. 189, median 74, std. dev. 41 instances per speaker

Segmentation and Annotation

Björn Schuller 11

• Labelling per InstanceSpeaker ID/gender, movie ID, attempt, running ID, begin/end

timeMajor and minor emotion attribute (16 options)Mood (7 options: amusement, irritation, neutrality,

embarrassment, positivity, stress, timidity, =0.41)𝜅6 Dimensions: 3 states

Segmentation and Annotation

Björn Schuller 12

• Major and MinorFrequencies per labeller

Annotation

Björn Schuller 13

• Major and MinorHeat map of pairsPotentially 256 combinations 118 found in the setStrong presence of blended

Full agreement on major/minor:105 combinations 2 091 instancesi. e. half of the corpus

Blended emotions well identifiable

Annotation

Björn Schuller 14

• Distribution of DimensionsTypical imbalance in favor of negative valence

Annotation

Björn Schuller 15

• Agreement DimensionsMonotonic increase from unweighted to quadratic kappa:

label confusions preferably in neighboring classesApart from suddenness, good concurrence at ≥ 0.4𝜅

Annotation

Björn Schuller 16

Recognition of Complex Emotions

• Train, Development, TestFoster easy reproducibility of results Proper definition of a development set

Straightforward three-fold partitioning by speaker index:Train (≈40%/ 21 speakers: ID 1–21)Development (≈30%/15 speakers: ID 22–36)Test (≈30%/ 15 speakers: ID 37–51)

Strict speaker independence‘Genuine’ results w/o previous fine-tuning on the test partition

Data Partitioning

Björn Schuller 18

• openEARopenSMILE’s “base” set 988 features

Slight extension over INTERSPEECH 2009Emotion Challenge

Systematic brute-forcing19 functionals of 26 low-level descriptors SMA LP filteredPlus regression coeff’s

Acoustic Features

Björn Schuller 19

• Upper BoundsFirst major and minor emotions separatelyMax. 16 classes

Then complex compound Max. 256 classes (quadratic number as order matters)Not all permutations occurDependencies among labels have to be assumed:

Scripted recording protocol and in general

Problem Complexity

Björn Schuller 20

• AlternativesBest fuzzy architecture for multiple labels:

e.g. multi-task neural networks?

Different weighting of major/minor emotioncomparison with the N-best result list?

• Chosen Way‘Traditional’ Support Vector MachinesPolynomial KernelPair-wise multi-class discriminationSequential Minimal Optimization learningTraining up-sampled in case of high class imbalance

Classification Strategy

Björn Schuller 21

• ‘Fixed Minor’‘Conventional’ case Minor emotion fixed as neutralMajor emotion varied Full labeler agreement950 instances, 5 classes providing sufficient instances (major–minor, # instances):

AMU –NEU (79)DEC –NEU (204) ENE –NEU (359) INQ –NEU (202) SAT –NEU (106)

Three Examples

Björn Schuller 22

• ‘Fixed Major’Different blends of irritationMajor emotion fixed as irritation Minor emotion varied Full labeler agreement607 instances, again 5 classes providing sufficient instances

ENE– COL (186)ENE– DEC (110)ENE– INQ (66)ENE– IRO (51)ENE– NEU (184)

Three Examples

Björn Schuller 23

• ‘Fully Mixed’Full labeler agreement533 instances, again 5 classes providing sufficient instances

INQ–NEU (114)STR–INQ (63)ENE–COL (186)ENE–DEC (110)JOI–SUR (60)

Examples in no stricter relation to each otherBut: demonstrate that feasible even in full major/minor mix

Three Examples

Björn Schuller 24

• ResultsWeighted Average Recall (WAR, i. e. recognition rate) Unweighted Average Recall (UAR, reflect imbalance among

classes) Area under the receiver operating curve (AUC)

Three Examples

Björn Schuller 25

• Results for Selected DimensionsGround truth by mean of labellersAll instances usedCross correlation (CC), mean linear error (MLE)Support Vector RegressionPrediction can be used as features for complex emotionsHighly imbalanced distribution

Regression Baseline

Björn Schuller 26

Conclusions

• Corpus for Complex EmotionsComparatively large CINEMO corpus

• BaselinesFirst impressions on the challenge

• Future Directions… Future large resources with recordings ‘in the wild’

Tailored classification architectures:Exploit the mutual information among major and minor emotionsComplex ‘language models’ to reflect transition probabilities

Conclusions

Björn Schuller 28

Merci.

This work was partly funded by the ANR project Affective Avatar.

Recommended