6
Bilingual Acoustic Feature Selection for Emotion Estimation Using a 3D Continuous Model Humberto P´ erez Espinosa, Carlos A. Reyes Garc´ ıa, Luis Villase˜ nor Pineda Instituto Nacional de Astrof´ ısica ´ Optica y Electr´ onica Luis E. Erro 1. Tonantzintla, Puebla, 72840, M´ exico [email protected], [email protected], [email protected] Abstract— Emotions are complex human phenomena. Num- berless researchers have attempted a variety of approaches to model these phenomena and to find the optimal set of emotion descriptors. In this paper, we search for the most appropriate acoustic features to estimate the emotional content in speech based on a continuous multi-dimensional model of emotions. We analyze a set of 6,920 features using the feature selection method known as Linear Forward Selection. We studied the importance of the features dividing them into groups and working with two databases, one in English and one in German to analyze the multi-lingual importance of features and to know if these features are important regardless of the language. I. INTRODUCTION Some of the first questions that arise when engaging in Speech Emotion Recognition are the following: What evidence is there that people’s emotional states are reflected in their voices?. Are emotions reflected in a similar way in all people?. Is the way we express emotions in speech dependent on social and cultural factors?. Is it possible to obtain objective measures of emotions?. Philosophers, anthropologists, psychologists, biologists, and more recently, computer scientists have attempted to answer these questions. For example, Charles Darwin established that emotions are patterns related to survival. These patterns have evolved to solve certain problems that species faced during their evolution [1]. In this way, emotions are more or less the same in all humans and in particular independent of culture. However, there is a strong debate on this issue among psychologists who claim that emotions are universal and psychologists who claim that emotions are culture-dependent [2]. Both groups of scientists have provided evidence of di- fferences and similarities between the way different cultures express emotions. In fact, some authors have come to de- fine universal facial emotional expressions. The psychologist Paul Ekman [3] established six facial expressions related to basic emotions known as The Big Six. Moreover, Izard [2] provides evidence that for certain culture recognizing emotions in facial expressions of people from other cultures is more difficult than do it for people from their own culture. Picard established that expressive patterns depend on gender, context, social and cultural expectations. Given that a particular emotion is felt, a variety of factors influence how the emotion is displayed [4]. Psychologists have proposed models that explain the generation, composition, and cla- ssification of emotions. The area of the Automatic Emotion Recognition has taken up these models and, based on them, has used two annotation approaches to capture and describe the emotional content in speech: discrete and continuous approaches. Discrete approach is based on the concept of Basic Emotion such as anger, fear, happiness, etc., that are the most intense form of emotions from which all other emotions are generated by variations or combinations of them. They assume the existence of universal emotions that can be clearly distinguished from each other by most people. On the other hand, continuous approach represents emotional states using a continuous multidimensional space. Emotions are represented by regions in an n-dimensional space where each dimension represents an Emotion Primitive. Emotion Primitives are subjective properties shown by all emotions. Several authors have worked on the analysis of the most important acoustic features from the point of view of discrete categorization, working on mono-lingual [5], [6], [7] and multi-lingual [8] data. However, they have not yet studied with the same depth the importance of acoustic attributes from the continuous models point of view. In this work we are interested in whether there are acoustic features that allow us to estimate the emotional state from the voice of a person no matter what language he/she speaks. We also discuss the importance of these features, the amount of information they provide and which are most important to each language. To accomplish this, we work with two databases of emotional speech, one in English and one in German. We extract a variety of acoustic features and apply feature selection techniques to find the best feature subsets in mono-lingual and bilingual mode. Finally, we discuss the feature groups separately, using metrics that give us an hint of the importance of each group. II. THREE-DIMENSIONAL CONTINUOUS MODEL The three-dimensional continuous model that we adopt in this work represents emotional states using the Emotion Primitives: Valence, Activation and Dominance [9]. Continu- ous approach exhibits great potential to model the occurrence of emotions in real world. In a realistic scenario, emotions are not generated in a prototypical or pure modality. They are often complex emotional states, which are a mixture of emotions with varying degrees of intensity or expressive- ness. This approach allows a more flexible interpretation of emotional speech acoustic properties. Given that it is not 786

[IEEE Gesture Recognition (FG 2011) - Santa Barbara, CA, USA (2011.03.21-2011.03.25)] Face and Gesture 2011 - Bilingual acoustic feature selection for emotion estimation using a 3D

Embed Size (px)

Citation preview

Page 1: [IEEE Gesture Recognition (FG 2011) - Santa Barbara, CA, USA (2011.03.21-2011.03.25)] Face and Gesture 2011 - Bilingual acoustic feature selection for emotion estimation using a 3D

Bilingual Acoustic Feature Selection for Emotion Estimation Using a 3DContinuous Model

Humberto Perez Espinosa, Carlos A. Reyes Garcıa, Luis Villasenor Pineda

Instituto Nacional de Astrofısica Optica y Electronica

Luis E. Erro 1. Tonantzintla, Puebla, 72840, Mexico

[email protected], [email protected], [email protected]

Abstract— Emotions are complex human phenomena. Num-berless researchers have attempted a variety of approaches tomodel these phenomena and to find the optimal set of emotiondescriptors. In this paper, we search for the most appropriateacoustic features to estimate the emotional content in speechbased on a continuous multi-dimensional model of emotions. Weanalyze a set of 6,920 features using the feature selection methodknown as Linear Forward Selection. We studied the importanceof the features dividing them into groups and working withtwo databases, one in English and one in German to analyzethe multi-lingual importance of features and to know if thesefeatures are important regardless of the language.

I. INTRODUCTION

Some of the first questions that arise when engaging

in Speech Emotion Recognition are the following: What

evidence is there that people’s emotional states are reflected

in their voices?. Are emotions reflected in a similar way

in all people?. Is the way we express emotions in speech

dependent on social and cultural factors?. Is it possible

to obtain objective measures of emotions?. Philosophers,

anthropologists, psychologists, biologists, and more recently,

computer scientists have attempted to answer these questions.

For example, Charles Darwin established that emotions are

patterns related to survival. These patterns have evolved

to solve certain problems that species faced during their

evolution [1]. In this way, emotions are more or less the

same in all humans and in particular independent of culture.

However, there is a strong debate on this issue among

psychologists who claim that emotions are universal and

psychologists who claim that emotions are culture-dependent

[2]. Both groups of scientists have provided evidence of di-

fferences and similarities between the way different cultures

express emotions. In fact, some authors have come to de-

fine universal facial emotional expressions. The psychologist

Paul Ekman [3] established six facial expressions related

to basic emotions known as The Big Six. Moreover, Izard

[2] provides evidence that for certain culture recognizing

emotions in facial expressions of people from other cultures

is more difficult than do it for people from their own

culture. Picard established that expressive patterns depend on

gender, context, social and cultural expectations. Given that a

particular emotion is felt, a variety of factors influence how

the emotion is displayed [4]. Psychologists have proposed

models that explain the generation, composition, and cla-

ssification of emotions. The area of the Automatic Emotion

Recognition has taken up these models and, based on them,

has used two annotation approaches to capture and describe

the emotional content in speech: discrete and continuous

approaches. Discrete approach is based on the concept of

Basic Emotion such as anger, fear, happiness, etc., that are

the most intense form of emotions from which all other

emotions are generated by variations or combinations of

them. They assume the existence of universal emotions that

can be clearly distinguished from each other by most people.

On the other hand, continuous approach represents emotional

states using a continuous multidimensional space. Emotions

are represented by regions in an n-dimensional space where

each dimension represents an Emotion Primitive. Emotion

Primitives are subjective properties shown by all emotions.

Several authors have worked on the analysis of the most

important acoustic features from the point of view of discrete

categorization, working on mono-lingual [5], [6], [7] and

multi-lingual [8] data. However, they have not yet studied

with the same depth the importance of acoustic attributes

from the continuous models point of view. In this work we

are interested in whether there are acoustic features that

allow us to estimate the emotional state from the voice

of a person no matter what language he/she speaks. We

also discuss the importance of these features, the amount

of information they provide and which are most important

to each language. To accomplish this, we work with two

databases of emotional speech, one in English and one in

German. We extract a variety of acoustic features and apply

feature selection techniques to find the best feature subsets

in mono-lingual and bilingual mode. Finally, we discuss the

feature groups separately, using metrics that give us an hint

of the importance of each group.

II. THREE-DIMENSIONAL CONTINUOUS MODEL

The three-dimensional continuous model that we adopt

in this work represents emotional states using the Emotion

Primitives: Valence, Activation and Dominance [9]. Continu-

ous approach exhibits great potential to model the occurrence

of emotions in real world. In a realistic scenario, emotions

are not generated in a prototypical or pure modality. They

are often complex emotional states, which are a mixture of

emotions with varying degrees of intensity or expressive-

ness. This approach allows a more flexible interpretation of

emotional speech acoustic properties. Given that it is not

786

Page 2: [IEEE Gesture Recognition (FG 2011) - Santa Barbara, CA, USA (2011.03.21-2011.03.25)] Face and Gesture 2011 - Bilingual acoustic feature selection for emotion estimation using a 3D

tied to a fixed set of prototypical emotions, it is capable

of representing any emotional state. Adopting this approach,

we get rid of some limitations imposed by discrete models,

for example, find, define and name the categorical emotion

labels needed to represent a sufficiently broad spectrum of

emotional states according to a specific application. Some

authors have begun to research about how to take advantage

of this theory [10], [11] to estimate more adequately the

emotional expressions.

• Valence describes how negative or positive is a specific

emotion.

• Activation describes the internal excitement of an in-

dividual and ranges from being very calm to be very

active.

• Dominance describes the degree of control that the

individual intends to take on the situation, or in other

words, how strong or weak the individual seems to be.

III. EMOTIONAL SPEECH DATA

To compare the multi-lingual performance of acoustic

features from the point of view of the continuous emotion

models we need at least two databases in different languages

labeled with the same Emotion Primitives. The databases we

used are IEMOCAP and VAM. The former was collected

at the Signal Analysis And Interpretation Laboratory at the

University of Southern California and it is in English [12].

It was recorded from ten actors in male/woman pairs. Its

annotation includes both approaches, Basic Emotions and

the Emotion Primitives: Valence, Activation and Dominance.

Attention has been given to spontaneous interaction, in spite

of using actors. The database shows a significant diversity

of emotional states. The IEMOCAP data set contains 1,820

instances. We selected the instances from the four classes

with more examples (Anger, Happiness, Sad and Neutral).

Furthermore we removed all instances that have the same

annotation for each of the three primitives, but different

annotation for Basic Emotion, to be regarded as contradictory

instances that add noise to our learning process. After this

filtering, we were working with a set of 401 instances in the

feature selection process. The second corpus we used for

this work is called VAM Corpus and is described in [13]. It

was collected by Michael Grimm and the Emotion Research

Group at the Institut fr Nachrichtentechnik of the Universitt

Karlsruhe (TH), Karlsruhe, Germany and it is in German.

This corpus consists of 12 hours of audio and video record-

ings of the German Talk Show ”Vera am Mittag”. It was

labeled with the emotional primitives: Valence, Activation

and Dominance. To label the corpus, 17 human evaluators

were employed. It contains 947 utterances belonging to 47

speakers (11 m / 36 f) with an average duration of 3.0

seconds per utterance. In the case of VAM corpus, there

are no annotations of basic emotions, so we were not able

to detect contradictory instances. We used the 947 instances

available. Annotated values were normalized to continuous

values between 1 and 5. Originally, primitives range from

-1 to 1 in VAM corpus meanwhile in IEMOCAP they range

from 1 to 5.

IV. FEATURES EXTRACTION

We evaluated two sets of features; one of them was ob-

tained through a selective approach, i.e., taking into account

features we think could be useful, features that have been

successful in related works and features used for other similar

tasks. The second feature set was obtained by applying

a brute force approach, i.e., generating a large amount of

them hoping that some will be found useful. The Selective

Feature Set is a set of features that we have been building

over our research [14] and was extracted with the software

Praat [15]. We propose three sets of features: Prosodic,

Spectral and Voice Quality. We subdivided Prosodic features

in Elocution Times, Melodic Contour and Energy Contour.

We designed this set of features representing several voice

aspects. We included the traditional attributes associated with

prosody, e.g., Duration, Pitch and Energy. Others which

have shown good results in tasks like speech recognition,

speaker recognition, classification of baby cry [16], language

recognition [17], and pathological voices detection [18], [19].

TABLE I

Acoustic Features Groups

Group Approach Feature Type # FeatsProsodic

Times Selective Elocution Times 8F0 Selective Melodic Contour 9

Energy Selective Energy Contour 12Energy Brute Force LOG Energy 117Times Brute Force Zero Crossing Rate 117PoV Brute Force Probability of Voicing 117F0 Brute Force F0 Contour 234

Voice QualityVoice Quality Selective Quality Descriptors 24Voice Quality Selective Articulation 12

SpectralLPC Selective Fast Fourier Transform 4LPC Selective Long Term Average 5LPC Selective Wavelets 6

MFCC Selective MFCC 96Cochleagrams Selective Cochleagram 96

LPC Selective LPC 96MFCC Brute Force MFCC 1,521MEL Brute Force MEL Spectrum 3,042SEB Brute Force Spectral Energy in Bands 469

SROP Brute Force Spectral Roll Off Point 468SFlux Brute Force Spectral Flux 117

SC Brute Force Spectral Centroid 117SMM Brute Force Spectral Max and Min 233

Total 6,920

The brute force feature set was extracted using the soft-

ware OpenEar [20]. We extract a total of 6,552 features in-

cluding first-order functionals of low-level descriptors (LLD)

such as FFT-Spectrum, Mel-Spectrum, MFCC, Pitch (Fun-

damental Frequency F0 via ACF), Energy, Spectral, LSP.

Their deltas and double deltas. Thirty-nine functionals such

as Extremes, Regression, Moments, Percentiles, Crossings,

Peaks, Means were applied. Table I shows the number of

acoustic features that were included in each feature group.

787

Page 3: [IEEE Gesture Recognition (FG 2011) - Santa Barbara, CA, USA (2011.03.21-2011.03.25)] Face and Gesture 2011 - Bilingual acoustic feature selection for emotion estimation using a 3D

V. FEATURE SELECTION

We needed to devise a way to select the best features

among a large number of them, bearing in mind that we

have relatively few instances. With a limited amount of

data, an excessive amount of attributes significantly delays

the learning process and often results in over-fitting. Figure

1 shows the processes we followed for selecting the best

subsets of attributes. For mono-lingual selection we started

from an Initial Feature Set obtained from a feature selection

process applied to the VAM corpus in a previous work

[14]. This initial feature set achieved good correlation results

in the estimation of Emotional Primitives in VAM. This

selection process was carried out with 252 features obtained

by selective approach and 949 instances. Later, we conducted

the instance selection process explained in Section III (only

for IEMOCAP). Finally, we applied the feature selection

process known as Linear Floating Forward Selection (LFFS)[21] which makes a Hill-Climbing search, starting with the

Initial Feature Set, evaluates all possible inclusions of a

single attribute from the extended feature set to the solution

subset. At each step the attribute with the best evaluation

is added. The search ends when there are no inclusions

that improve the evaluation. In addition, LFFS dynamically

changes the number of features included or eliminated at

each step. In our experiments we use the LFFS modality

called Fixed Width. In this mode the search is not performed

on the whole feature set. Initially, the k best features are

selected and the rest are removed. In each iteration, the

features added to the solution set are replaced by features

taken from those that had been eliminated. This feature

selection process is repeated for each Emotion Primitive.

Fig. 1. Feature Selection Scheme

In the case of bilingual feature selection the union of the

two features sets obtained from the mono-lingual selection

of features was taken as the initial feature set.

VI. FEATURE SELECTION RESULTS

All the results of the learning experiments were obtained

using Support Vector Machine for Regression (SMOreg) and

validated by 10-Fold Cross Validation. The metrics used to

measure the importance of feature groups are: Pearson’s

Correlation Coefficient, Share and Portion. The Correla-

tion Coefficient is the most common parameter to measure

the machine learning algorithms performance on regression

tasks, as in our case. We use Share and Portion that are

measures proposed in [7] to assess the impact of different

types of features on the performance of automatic recognition

of discrete categorical emotions.

Correlation coefficient: Indicates the strength and di-

rection of the linear relationship between the annotated

primitives and the estimated primitives by the trained model.

It is our main metric to measure the classification results.

Share: Shows the contribution of types of features. For

example, 28 features are selected from the group Duration

and 150 were selected in total.

Share = (28 x 100 ) / 150 = 18.7

Portion: Shows the contribution of types of features

weighted by the number of features per type. For example, if

28 duration features are selected from a total of 391 duration

feature set:

Portion = (28 x 100) / 391 = 7.2

Having identified the best acoustic feature sets we con-

structed individual classifiers to estimate each Emotion Prim-

itive. Tables II, III and IV reflect the effectiveness of each

feature group and the differences among languages. Groups

PoV and SC are not shown in these tables given that none

of the features belonging to those groups were selected by

the LFS algorithm in the experiments. It is important to

note that the Correlation, Share and Portion shown in these

tables are obtained by decomposing in groups of features the

solution set found by the selection process and evaluating

them separately with these metrics. The feature selection

process did not include any element of certain groups in the

solution set. The last row shows the total number of features

selected for the Emotion Primitive. Results in those tables

are presented in the format: Results for English / Results forGerman / Results for Bilingual. The mono-lingual results

for English and German were obtained by training SMOreg

classifiers with the features selected for each language and

each primitive separately. The evaluation was done by 10-

fold cross. The bilingual result was obtained by building

classifiers for each primitive using the instances of the two

languages and the feature set obtained from the bi-language

feature selection.

A. Mono-Lingual Feature Selection

For Valence, we obtained better results in English, reach-

ing a correlation index of 0.61. For English, MEL had the

best performance (0.51). For German, three groups reached

0.34, MEL, F0, and LPC. We can see that while MEL,

LPC and MFCC are important for both languages, F0 and

SEB were important for German, but not English. A striking

aspect is that the MEL group, with a very low Portion (0.24

/ 0.06), i.e. with very few selected features (7 / 2) with

respect to total features in the group (3,042), achieved the

best correlation of all groups for both languages. For Valence,

it is difficult to infer intuitively which prosodic features are

related to positive and negative emotions. For example, we

may think of negative emotions with opposite Energy values

as angry whose Energy is high and sad whose Energy is

low. Or in positive emotions with opposite values of Times,

as excited whose Times are quick and relaxed whose Times

are slow. On the other hand, for Activation is reasonable to

788

Page 4: [IEEE Gesture Recognition (FG 2011) - Santa Barbara, CA, USA (2011.03.21-2011.03.25)] Face and Gesture 2011 - Bilingual acoustic feature selection for emotion estimation using a 3D

TABLE II

English / German / Bilingual Feature Selection Results - Valence

Feature Group Total Selected Correlation Share PortionVoice Quality 36 6 / 3 / 9 0.29 / 0.08 / 0.26 9.67 / 4.76 / 5.92 16.66 / 8.33 / 25.00

Times 125 1 / 1 / 6 0.33 / 0.05 / 0.21 1.61 / 1.58 / 3.95 0.80 / 0.80 / 4.8Cochleagrams 96 8 / 0 / 10 0.40 / - / 0.33 12.90 / 0.00 / 6.58 8.33 / 0.00 / 10.42

LPC 111 13 / 8 / 14 0.34 / 0.34 / 0.43 20.96 / 12.69 / 9.21 11.71 / 7.20 / 12.61sFlux 117 0 / 1 / 8 - / 0.30 / 0.39 0.00 / 1.58 / 5.26 0.00 / 0.85 / 6.84

Energy 129 0 / 0 / 6 - / - / 0.12 0.00 / 0.00 / 3.95 0.00 / 0.00 / 4.65F0 243 0 / 4 / 6 - / 0.34 / 0.30 0.00 / 6.35 / 3.95 0.00 / 1.64 / 2.47

SpecMaxMin 233 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00SEB 469 0 / 5 / 10 - / 0.33 / 0.29 0.00 / 7.93 / 6.58 0.00 / 1.07 / 2.13

SROP 468 0 / 0 / 6 - / - / 0.20 0.00 / 0.00 / 3.95 0.00 / 0.00 / 1.28MFCC 1,617 27 / 39 / 51 0.48 / 0.29 / 0.39 43.54 / 61.90 / 33.55 1.67 / 2.41 / 3.15MEL 3,042 7 / 2 / 26 0.51 / 0.34 / 0.41 11.29 / 3.17 / 17.10 0.24 / 0.06 / 0.85

All 6,920 62 / 63 / 152 0.61 / 0.49 / 0.53 100 / 100 / 100 0.89 / 0.91 / 2.20

TABLE III

English / German / Bilingual Feature Selection Results - Activation

Feature Group Total Selected Correlation Share PortionVoice Quality 36 3 / 4 / 6 0.08 / 0.33 / 0.31 5.00 / 8.16 / 5.45 8.33 / 11.11 / 16.67

Times 125 1 / 0 / 1 0.16 / - / 0.03 1.66 / 0.00 / 0.91 0.80 / 0.00 / 0.80Cochleagrams 96 24 / 7 / 32 0.77 / 0.71 / 0.75 40.00 / 14.28 / 29.09 25.00 / 7.29 / 33.33

LPC 111 4 / 3 / 6 0.61 / 0.72 / 0.65 6.66 / 6.12 / 5.45 3.60 / 2.70 / 5.41sFlux 117 0 / 3 / 4 - / 0.76 / 0.57 0.00 / 6.12 / 3.64 0.00 / 2.56 / 3.42

Energy 129 4 / 4 / 7 0.76 / 0.59 / 0.61 6.66 / 8.16 / 6.36 3.10 / 3.10 / 5.43F0 243 1 / 4 / 4 0.29 / 0.54 / 0.44 1.667 / 8.16 / 3.64 0.412 / 1.64 / 1.65

SpecMaxMin 234 0 / 0 / 0 - / - / - 0.00 / 0.00 /0.00 0.00 / 0.00 / 0.00SEB 234 0 / 1 / 1 - / 0.65 / 0.31 0.00 / 2.04 / 0.91 0.00 / 0.21 / 0.21

SROP 468 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00MFCC 1,617 23 / 22 / 42 0.77 / 0.77 / 0.75 38.33 / 44.89 / 38.18 1.42 / 1.36 / 2.59MEL 3,042 0 / 1 / 7 - / 0.65 / 0.59 0.00 / 2.04 / 6.36 0.00 / 0.32 / 0.23

All 6,920 60 / 49 / 110 0.79 / 0.82 / 0.81 100 / 100 / 100 0.86 / 0.70 / 1.58

TABLE IV

English / German / Bilingual Feature Selection Results - Dominance

Feature Group Total Selected Correlation Share PortionVoice Quality 36 0 / 4 / 4 - / 0.29 / 0.26 0.00 / 6.67 / 4.25 0.00 / 11.11 / 11.11

Times 125 2 / 0 / 2 0.35 / - / 0.07 6.45 / 0.00 / 2.13 1.60 / 0.00 / 1.60Cochleagrams 96 7 / 12 / 16 0.70 / 0.75 / 0.69 22.58 / 20.00 / 17.02 7.29 / 12.50 / 16.67

LPC 111 1 / 2 / 3 0.66 / 0.60 / 0.64 3.22 / 3.33 / 3.19 0.90 / 1.80 / 2.70sFlux 117 1 / 2 / 1 0.61 / 0.73 / 0.53 3.22 / 3.33 / 1.06 0.85 / 1.70 / 0.85

Energy 129 2 / 5 / 12 0.15 / 0.65 / 0.67 6.45 / 8.33 / 12.76 1.15 / 3.87 / 9.30F0 243 3 / 4 / 7 0.22 / 0.52 / 0.40 9.67 / 6.66 / 7.45 1.23 / 1.64 / 2.88

SpecMaxMin 234 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00SEB 234 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00

SROP 468 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00MFCC 1,617 11 / 30 / 44 0.71 / 0.74 / 0.70 35.48 / 50.00 / 46.81 0.68 / 1.86 / 2.72MEL 3,042 4 / 1 / 0 / 5 0.68 / 0.63 / 0.56 12.90 / 1.67 / 5.32 0.13 / 0.03 / 0.16

All 6,920 31 / 60 / 94 0.74 / 0.81 / 0.74 100 / 100 / 100 0.44 / 0.86 / 1.36

think that the faster and louder we speak the more active we

are perceived to be and that the slower and lower we speak

the more passive we are perceived to be [22]. Therefore,

intuitively, groups of features that model prosodic aspects

such as Energy and Times should be more important to

estimate Activation. In the experiments we could confirm

that indeed Energy was important in both languages (0.76

/ 0.59), but Times did not provide valuable information to

estimate this primitive. This makes us doubt that the Times

features used here are reflecting adequately the phenomena

related to speed for the languages we are working with.

For Activation the best group was MFCC (0.77 / 0.77),

with a high percentage of the total features of the final set

(38.33 / 44.89). We can see that this group by itself gets

a performance similar to the performance of the solution

set (0.79 / 0.82). Other important groups for Activation are

Cochleagrams that obtained good results for both languages

(0.77 / 0.71), LPC (0.61 / 0.72) and as mentioned previously

789

Page 5: [IEEE Gesture Recognition (FG 2011) - Santa Barbara, CA, USA (2011.03.21-2011.03.25)] Face and Gesture 2011 - Bilingual acoustic feature selection for emotion estimation using a 3D

Energy (0.76 / 0.59). This primitive did not show significant

differences when estimating it in both languages (0.79 /

0.82). As for Activation, for Dominance the best groups

were MFCC and Cochleagrams. This Primitive shows many

similarities in both languages. Almost all groups in both

languages agreed on the degree of importance. The group

with less agreement is Energy that intuitively might be good

to indicate that the person tries to control a situation. For

English, Energy proved to be a poor indicator, but good

for German (0.15 / 0.65). Other important groups for this

primitive are LPC (0.66 / 0.60), sFlux (0.61 / 0.73) and MEL

(0.68 / 0.63). In German, almost twice as many features as in

English were selected, while retaining the selected features

proportion of each group reflected in Share and Portion.

B. Bilingual Feature Selection

In the case of Valence, the best correlation was obtained

with LPC and MEL. We can see that when we estimate

bilingual Valence, correlation (0.53) was at an intermediate

point between the mono-lingual English (0.61) and mono-

lingual German (0.49). Bilingual selection added attributes

that had not been considered in a mono-lingual way such

as Energy and SROP. For the case of Activation, the best

correlation for a group was obtained by MFCC, Cochlea-

grams, LPC and Energy, similarly to what had happened for

English. We can see that the correlation when estimating

Activation in the bilingual set was good (0.81) compared

with the results obtained by estimating Activation in the

two languages separately (0.79 / 0.82). Indeed, correlation

when estimating the two languages is better than the cor-

relation when estimating only English. For Dominance, the

best groups were MFCC, Cochleagrams, and Energy. That

agrees with the mono-lingual selection. The correlation when

estimating bilingual Dominance was the same as English

estimation (0.74) and worse than German estimation (0.81).

VII. CROSS-LINGUAL PERFORMANCE

A. Experiment 1

We use the selected features in a cross-lingual mode. That

is, we used the features found in English to estimate the

Emotion Primitives in German and vice-versa. To evaluate

this experiment we performed a 10-fold cross validation,

using only data from one language at a time. The idea of

this scenario is to analyze whether the features that are good

for a particular language are also good for the other. For

example, we see that the correlation for Valence in German

using the features found in German is 0.49, while using those

found in English greatly decreased to 0.33. For Activation,

it decreases from 0.82 to 0.78 that is a much smaller

decrease. In all cases, the correlation in the estimation of

the Primitives decreases. Although in some cases is much

larger the difference.

B. Experiment 2

We used the features found in bilingual selection. To

evaluate it, we trained with the instances of a language and

tested with the instances of the other language. The idea in

TABLE V

Experiment 1 Correlation index obtained in mono-lingual emotion

primitives estimation for cross-lingual feature selection

Emotion Primitive English GermanValence 0.5163 0.3372

Activation 0.7678 0.7867Dominance 0.7080 0.7779

this scenario is to examine whether the patterns learned to

estimate the Emotion Primitives in a language can be used to

estimate the Emotion Primitives in another language. In this

case, we can see that greatly decreased the accuracy of the

estimations and that this task is difficult, especially using the

patterns learned in English to estimate Emotion Primitives in

German. Conversely, the deterioration is not as strong.

TABLE VI

Experiment 2 Train/Test Correlation index obtained in cross-lingual

primitives estimation for bilingual feature selection

Emotion Primitive German/English English/GermanValence 0.2676 0.0146

Activation 0.6183 0.5470Dominance 0.6927 0.1156

C. Experiment 3

We used the features found in the bilingual selection.

To evaluate, we performed a 10-fold cross validation using

data of only one language at a time. The idea here is to

check whether the features found in the bilingual selection

are complementary in some way when estimating samples

of one language at a time. We can see that it was not

true since, apparently, there are features that help in one

language, but somehow affect the estimation of the other

given that correlation decreases. For example, Activation

in this scenario is (0.75 / 0.80) while doing mono-lingual

selection we obtained (0.79 / 0.82).

TABLE VII

Experiment 3 Correlation index obtained in mono-lingual emotion

primitives estimation for bilingual feature selection

Emotion Primitive English GermanValence 0.5992 0.3168

Activation 0.7557 0.8048Dominance 0.7255 0.7861

VIII. CONCLUSIONS

We performed a study on the importance of different

acoustic features to determine the emotional state of indi-

viduals in two languages, English and German. The study is

based on a continuous three-dimensional model of emotions,

which considers the emotional state by estimating the level of

the emotion primitives Valence, Activation and Dominance.

We analyzed each Emotion Primitive separately. Through the

790

Page 6: [IEEE Gesture Recognition (FG 2011) - Santa Barbara, CA, USA (2011.03.21-2011.03.25)] Face and Gesture 2011 - Bilingual acoustic feature selection for emotion estimation using a 3D

identification of the best features the automatic estimation

of Emotion Primitives becomes more accurately and thus

the recognition and classification of people’s emotional state

should improve. We have taken some ideas used in the study

of the impact of features in the classification of discrete

emotional categories, like Share and Portion metrics, and we

have applied them to the continuous approach. We divided

our 6,920 features set into 12 groups according to their

acoustic properties. We calculated some metrics for each

group in order to estimate their performance in the automatic

estimation of Emotion Primitives (Correlation Coefficient)

and their contribution to the final set of features (Share and

Portion). We worked with two databases of emotional speech.

We realized that Spectral feature groups are very important

for the three primitives. According to our results the most

important feature groups across languages are:

• Valence: LPC - MEL - MFCC - sFlux

• Activation: MFCC - Cochleagrams - LPC - Energy

• Dominance: MFCC - Cochleagrams - Energy - LPC

Clearly MFCC and LPC groups are very important to esti-

mate the three Emotion Primitives, since they appear among

the most important for the three primitives. Other spectral

information groups like sFlux, Cochleagrams and MEL were

important, we can conclude from this fact that Spectral

analysis is more important than Prosodic and Voice Quality

analysis for Emotion Primitive estimation. The prosodic

features group Energy was also very important for Activation

and Dominance. In the experiments performed here we can

see that the features selected for a particular language had

an acceptable performance in the other language. Correlation

was better when we used features selected for only one

language, but the difference was not very notable. On the

other hand, we see that when we crossed the trained models

to estimate Emotion Primitives, the correlation decreased

dramatically, mainly for Dominance and Valence. From these

two facts, we can conclude that emotional states can be

estimated using a similar set of acoustic features. However,

the patterns shown by these features are difficult to take from

one language to another without any adaptation. When we

learned from both languages we obtained a lower correlation

than the highest monolingual correlation and higher or equal

to the lowest monolingual correlation, so we can conclude

that it is possible to identify common patterns in both

languages using a feature set that works for both languages.

Databases differ in some other aspects. Differences attributed

to language may be magnified by other reasons. An aspect

we could look at, for example, is the acoustic features

performance on acted emotions in a controlled environment

(IEMOCAP) versus spontaneous emotions in an uncontrolled

environment (VAM). It is not clear that results are better

or worse for either of the two languages. It is expected

that results were better for the acted database. However, for

Activation and Dominance results are better for VAM. Other

factor that may be affecting is that the number of instances

of IEMOCAP used is less than half that used with VAM.

REFERENCES

[1] C. Darwin, The Expression of the Emotions in Man and Animals,Oxford University Press, Oxford,3. Ed., 1998, 1872/1998.

[2] H. A. Elfenbein, Mandal M. K., Ambady N., Harizuka S., and KumarS., “Cross-cultural patterns in emotion recognition: Highlightingdesign and analytical techniques,” Emotion, vol. 2, no. 1, pp. 75 – 84,2002.

[3] P. Ekman, “Universals and cultural differences in facial expressionsof emotion,” In: J. R. Cole, Ed., Nebraska Symposium on Motivation,pp. 207–283, 2002.

[4] R. W. Picard, Affective Computing, The MIT Press; 1st edition (July31, 2000), 2000.

[5] Bo Xie, Ling Chen, Gen-Cai Chen, and Chun Chen, Statistical FeatureSelection for Mandarin Speech Emotion Recognition, Springer Berlin/ Heidelberg, 2005.

[6] M. Lugger and B. Yang, “An incremental analysis of different featuregroups in speaker independent emotion recognition,” .

[7] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner,L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, and N. Amir,“Whodunnit - searching for the most important feature types signallingemotion-related user states in speech,” Comput. Speech Lang., vol. 25,no. 1, pp. 4–28, 2010.

[8] T. Polzehl, A. Schmitt, and F. Metze, “Approaching multilingualemotion recognition from speech - on language dependency of acous-tic/prosodic features for anger detection,” in Proc. of the FifthInternational Conference on Speech Prosody, 2010. Speech Prosody2010, May 2010.

[9] H. Schlosberg, “Three dimensions of emotion,” Psychological Review,vol. 61, no. 2, pp. 81–88, 1954.

[10] M. Lugger and B. Yang, “Cascaded emotion classification viapsychological emotion dimensions using large set of voice qualityparameters,” in International Conference on Acoustics, Speech, andSignal Processing (ICASSP 2008). 2008, pp. 4945 – 4948, Institute ofElectrical and Electronics Engineers.

[11] M. Wollmer, F. Eyben, B. Schuller, E. Douglas-Cowie, and R. Cowie,“Data-driven clustering in emotional space for affect recognition usingdiscriminatively trained lstm networks,” in Interspeech 2009. 2009, pp.1595–1598, International Speech Communication Association.

[12] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim,J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactiveemotional dyadic motion capture database,” Journal of LanguageResources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.

[13] S. Narayanan, M. Grimm, and K. Kroschel, “The vera am mittaggerman audio-visual emotional speech database,” in Proceedings ofthe IEEE International Conference on Multimedia and Expo (ICME2008), 2008, pp. 865–868.

[14] H. Perez, C. Garcıa, and L. Villasenor, “Features selection for primi-tives estimation on emotional speech,” in International Conference onAcoustics, Speech, and Signal Processing (ICASSP 2010). 2010, pp.5138–5141, Institute of Electrical and Electronics Engineers.

[15] P. Boersma, “Praat, a system for doing phonetics by computer,” inGlot International 5:9/10 2001, 2001, pp. 341–345.

[16] K. Santiago, C. A: Reyes G., and M.P. Gomez G., “Conjuntos difusostipo 2 aplicados a la comparacion difusa de patrones para clasificacionde llanto de infantes con riesgo neurologico,” M.S. thesis, INAOE,Tonantzintla, Puebla, Mexico, 2009.

[17] A.L Reyes, Un Metodo para la Identificacion del Lenguaje Habladoutilizando Informacion Suprasegmental, Ph.D. thesis, INAOE, To-nantzintla, Puebla, Mexico, 2007.

[18] T. Dubuisson, T. Dutoit, B. Gosselin, and M. Remacle, “On theuse of the correlation between acoustic descriptors for the nor-mal/pathological voices discrimination,” EURASIP Journal on Ad-vances in Signal Processing, Analysis and Signal Processing of Oe-sophageal and Pathological Voices, vol. 10.1155/2009/173967, 2009.

[19] C. T. Ishi, H. Ishiguro, and N. Hagita, “Proposal of acoustic measuresfor automatic detection of vocal fry,” in Interspeech 2005. 2005, pp.481–484, International Speech Communication Association.

[20] F. Eyben, M. Wollmer, and B. Schuller, “openear - introducing themunich open-source emotion and affect recognition toolkit,” in Proc.4th International HUMAINE Association Conference on AffectiveComputing and Intelligent Interaction 2009, 2009, pp. 1–6.

[21] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods infeature selection,” Pattern Recognition Letters, pp. 1119–1125, 1994.

[22] R. Kehrein, “The prosody of authentic emotions,” in Speech ProsodyConference 2002, 2002, pp. 423–426.

791