Upload
luis-villasenor
View
212
Download
0
Embed Size (px)
Citation preview
Bilingual Acoustic Feature Selection for Emotion Estimation Using a 3DContinuous Model
Humberto Perez Espinosa, Carlos A. Reyes Garcıa, Luis Villasenor Pineda
Instituto Nacional de Astrofısica Optica y Electronica
Luis E. Erro 1. Tonantzintla, Puebla, 72840, Mexico
[email protected], [email protected], [email protected]
Abstract— Emotions are complex human phenomena. Num-berless researchers have attempted a variety of approaches tomodel these phenomena and to find the optimal set of emotiondescriptors. In this paper, we search for the most appropriateacoustic features to estimate the emotional content in speechbased on a continuous multi-dimensional model of emotions. Weanalyze a set of 6,920 features using the feature selection methodknown as Linear Forward Selection. We studied the importanceof the features dividing them into groups and working withtwo databases, one in English and one in German to analyzethe multi-lingual importance of features and to know if thesefeatures are important regardless of the language.
I. INTRODUCTION
Some of the first questions that arise when engaging
in Speech Emotion Recognition are the following: What
evidence is there that people’s emotional states are reflected
in their voices?. Are emotions reflected in a similar way
in all people?. Is the way we express emotions in speech
dependent on social and cultural factors?. Is it possible
to obtain objective measures of emotions?. Philosophers,
anthropologists, psychologists, biologists, and more recently,
computer scientists have attempted to answer these questions.
For example, Charles Darwin established that emotions are
patterns related to survival. These patterns have evolved
to solve certain problems that species faced during their
evolution [1]. In this way, emotions are more or less the
same in all humans and in particular independent of culture.
However, there is a strong debate on this issue among
psychologists who claim that emotions are universal and
psychologists who claim that emotions are culture-dependent
[2]. Both groups of scientists have provided evidence of di-
fferences and similarities between the way different cultures
express emotions. In fact, some authors have come to de-
fine universal facial emotional expressions. The psychologist
Paul Ekman [3] established six facial expressions related
to basic emotions known as The Big Six. Moreover, Izard
[2] provides evidence that for certain culture recognizing
emotions in facial expressions of people from other cultures
is more difficult than do it for people from their own
culture. Picard established that expressive patterns depend on
gender, context, social and cultural expectations. Given that a
particular emotion is felt, a variety of factors influence how
the emotion is displayed [4]. Psychologists have proposed
models that explain the generation, composition, and cla-
ssification of emotions. The area of the Automatic Emotion
Recognition has taken up these models and, based on them,
has used two annotation approaches to capture and describe
the emotional content in speech: discrete and continuous
approaches. Discrete approach is based on the concept of
Basic Emotion such as anger, fear, happiness, etc., that are
the most intense form of emotions from which all other
emotions are generated by variations or combinations of
them. They assume the existence of universal emotions that
can be clearly distinguished from each other by most people.
On the other hand, continuous approach represents emotional
states using a continuous multidimensional space. Emotions
are represented by regions in an n-dimensional space where
each dimension represents an Emotion Primitive. Emotion
Primitives are subjective properties shown by all emotions.
Several authors have worked on the analysis of the most
important acoustic features from the point of view of discrete
categorization, working on mono-lingual [5], [6], [7] and
multi-lingual [8] data. However, they have not yet studied
with the same depth the importance of acoustic attributes
from the continuous models point of view. In this work we
are interested in whether there are acoustic features that
allow us to estimate the emotional state from the voice
of a person no matter what language he/she speaks. We
also discuss the importance of these features, the amount
of information they provide and which are most important
to each language. To accomplish this, we work with two
databases of emotional speech, one in English and one in
German. We extract a variety of acoustic features and apply
feature selection techniques to find the best feature subsets
in mono-lingual and bilingual mode. Finally, we discuss the
feature groups separately, using metrics that give us an hint
of the importance of each group.
II. THREE-DIMENSIONAL CONTINUOUS MODEL
The three-dimensional continuous model that we adopt
in this work represents emotional states using the Emotion
Primitives: Valence, Activation and Dominance [9]. Continu-
ous approach exhibits great potential to model the occurrence
of emotions in real world. In a realistic scenario, emotions
are not generated in a prototypical or pure modality. They
are often complex emotional states, which are a mixture of
emotions with varying degrees of intensity or expressive-
ness. This approach allows a more flexible interpretation of
emotional speech acoustic properties. Given that it is not
786
tied to a fixed set of prototypical emotions, it is capable
of representing any emotional state. Adopting this approach,
we get rid of some limitations imposed by discrete models,
for example, find, define and name the categorical emotion
labels needed to represent a sufficiently broad spectrum of
emotional states according to a specific application. Some
authors have begun to research about how to take advantage
of this theory [10], [11] to estimate more adequately the
emotional expressions.
• Valence describes how negative or positive is a specific
emotion.
• Activation describes the internal excitement of an in-
dividual and ranges from being very calm to be very
active.
• Dominance describes the degree of control that the
individual intends to take on the situation, or in other
words, how strong or weak the individual seems to be.
III. EMOTIONAL SPEECH DATA
To compare the multi-lingual performance of acoustic
features from the point of view of the continuous emotion
models we need at least two databases in different languages
labeled with the same Emotion Primitives. The databases we
used are IEMOCAP and VAM. The former was collected
at the Signal Analysis And Interpretation Laboratory at the
University of Southern California and it is in English [12].
It was recorded from ten actors in male/woman pairs. Its
annotation includes both approaches, Basic Emotions and
the Emotion Primitives: Valence, Activation and Dominance.
Attention has been given to spontaneous interaction, in spite
of using actors. The database shows a significant diversity
of emotional states. The IEMOCAP data set contains 1,820
instances. We selected the instances from the four classes
with more examples (Anger, Happiness, Sad and Neutral).
Furthermore we removed all instances that have the same
annotation for each of the three primitives, but different
annotation for Basic Emotion, to be regarded as contradictory
instances that add noise to our learning process. After this
filtering, we were working with a set of 401 instances in the
feature selection process. The second corpus we used for
this work is called VAM Corpus and is described in [13]. It
was collected by Michael Grimm and the Emotion Research
Group at the Institut fr Nachrichtentechnik of the Universitt
Karlsruhe (TH), Karlsruhe, Germany and it is in German.
This corpus consists of 12 hours of audio and video record-
ings of the German Talk Show ”Vera am Mittag”. It was
labeled with the emotional primitives: Valence, Activation
and Dominance. To label the corpus, 17 human evaluators
were employed. It contains 947 utterances belonging to 47
speakers (11 m / 36 f) with an average duration of 3.0
seconds per utterance. In the case of VAM corpus, there
are no annotations of basic emotions, so we were not able
to detect contradictory instances. We used the 947 instances
available. Annotated values were normalized to continuous
values between 1 and 5. Originally, primitives range from
-1 to 1 in VAM corpus meanwhile in IEMOCAP they range
from 1 to 5.
IV. FEATURES EXTRACTION
We evaluated two sets of features; one of them was ob-
tained through a selective approach, i.e., taking into account
features we think could be useful, features that have been
successful in related works and features used for other similar
tasks. The second feature set was obtained by applying
a brute force approach, i.e., generating a large amount of
them hoping that some will be found useful. The Selective
Feature Set is a set of features that we have been building
over our research [14] and was extracted with the software
Praat [15]. We propose three sets of features: Prosodic,
Spectral and Voice Quality. We subdivided Prosodic features
in Elocution Times, Melodic Contour and Energy Contour.
We designed this set of features representing several voice
aspects. We included the traditional attributes associated with
prosody, e.g., Duration, Pitch and Energy. Others which
have shown good results in tasks like speech recognition,
speaker recognition, classification of baby cry [16], language
recognition [17], and pathological voices detection [18], [19].
TABLE I
Acoustic Features Groups
Group Approach Feature Type # FeatsProsodic
Times Selective Elocution Times 8F0 Selective Melodic Contour 9
Energy Selective Energy Contour 12Energy Brute Force LOG Energy 117Times Brute Force Zero Crossing Rate 117PoV Brute Force Probability of Voicing 117F0 Brute Force F0 Contour 234
Voice QualityVoice Quality Selective Quality Descriptors 24Voice Quality Selective Articulation 12
SpectralLPC Selective Fast Fourier Transform 4LPC Selective Long Term Average 5LPC Selective Wavelets 6
MFCC Selective MFCC 96Cochleagrams Selective Cochleagram 96
LPC Selective LPC 96MFCC Brute Force MFCC 1,521MEL Brute Force MEL Spectrum 3,042SEB Brute Force Spectral Energy in Bands 469
SROP Brute Force Spectral Roll Off Point 468SFlux Brute Force Spectral Flux 117
SC Brute Force Spectral Centroid 117SMM Brute Force Spectral Max and Min 233
Total 6,920
The brute force feature set was extracted using the soft-
ware OpenEar [20]. We extract a total of 6,552 features in-
cluding first-order functionals of low-level descriptors (LLD)
such as FFT-Spectrum, Mel-Spectrum, MFCC, Pitch (Fun-
damental Frequency F0 via ACF), Energy, Spectral, LSP.
Their deltas and double deltas. Thirty-nine functionals such
as Extremes, Regression, Moments, Percentiles, Crossings,
Peaks, Means were applied. Table I shows the number of
acoustic features that were included in each feature group.
787
V. FEATURE SELECTION
We needed to devise a way to select the best features
among a large number of them, bearing in mind that we
have relatively few instances. With a limited amount of
data, an excessive amount of attributes significantly delays
the learning process and often results in over-fitting. Figure
1 shows the processes we followed for selecting the best
subsets of attributes. For mono-lingual selection we started
from an Initial Feature Set obtained from a feature selection
process applied to the VAM corpus in a previous work
[14]. This initial feature set achieved good correlation results
in the estimation of Emotional Primitives in VAM. This
selection process was carried out with 252 features obtained
by selective approach and 949 instances. Later, we conducted
the instance selection process explained in Section III (only
for IEMOCAP). Finally, we applied the feature selection
process known as Linear Floating Forward Selection (LFFS)[21] which makes a Hill-Climbing search, starting with the
Initial Feature Set, evaluates all possible inclusions of a
single attribute from the extended feature set to the solution
subset. At each step the attribute with the best evaluation
is added. The search ends when there are no inclusions
that improve the evaluation. In addition, LFFS dynamically
changes the number of features included or eliminated at
each step. In our experiments we use the LFFS modality
called Fixed Width. In this mode the search is not performed
on the whole feature set. Initially, the k best features are
selected and the rest are removed. In each iteration, the
features added to the solution set are replaced by features
taken from those that had been eliminated. This feature
selection process is repeated for each Emotion Primitive.
Fig. 1. Feature Selection Scheme
In the case of bilingual feature selection the union of the
two features sets obtained from the mono-lingual selection
of features was taken as the initial feature set.
VI. FEATURE SELECTION RESULTS
All the results of the learning experiments were obtained
using Support Vector Machine for Regression (SMOreg) and
validated by 10-Fold Cross Validation. The metrics used to
measure the importance of feature groups are: Pearson’s
Correlation Coefficient, Share and Portion. The Correla-
tion Coefficient is the most common parameter to measure
the machine learning algorithms performance on regression
tasks, as in our case. We use Share and Portion that are
measures proposed in [7] to assess the impact of different
types of features on the performance of automatic recognition
of discrete categorical emotions.
Correlation coefficient: Indicates the strength and di-
rection of the linear relationship between the annotated
primitives and the estimated primitives by the trained model.
It is our main metric to measure the classification results.
Share: Shows the contribution of types of features. For
example, 28 features are selected from the group Duration
and 150 were selected in total.
Share = (28 x 100 ) / 150 = 18.7
Portion: Shows the contribution of types of features
weighted by the number of features per type. For example, if
28 duration features are selected from a total of 391 duration
feature set:
Portion = (28 x 100) / 391 = 7.2
Having identified the best acoustic feature sets we con-
structed individual classifiers to estimate each Emotion Prim-
itive. Tables II, III and IV reflect the effectiveness of each
feature group and the differences among languages. Groups
PoV and SC are not shown in these tables given that none
of the features belonging to those groups were selected by
the LFS algorithm in the experiments. It is important to
note that the Correlation, Share and Portion shown in these
tables are obtained by decomposing in groups of features the
solution set found by the selection process and evaluating
them separately with these metrics. The feature selection
process did not include any element of certain groups in the
solution set. The last row shows the total number of features
selected for the Emotion Primitive. Results in those tables
are presented in the format: Results for English / Results forGerman / Results for Bilingual. The mono-lingual results
for English and German were obtained by training SMOreg
classifiers with the features selected for each language and
each primitive separately. The evaluation was done by 10-
fold cross. The bilingual result was obtained by building
classifiers for each primitive using the instances of the two
languages and the feature set obtained from the bi-language
feature selection.
A. Mono-Lingual Feature Selection
For Valence, we obtained better results in English, reach-
ing a correlation index of 0.61. For English, MEL had the
best performance (0.51). For German, three groups reached
0.34, MEL, F0, and LPC. We can see that while MEL,
LPC and MFCC are important for both languages, F0 and
SEB were important for German, but not English. A striking
aspect is that the MEL group, with a very low Portion (0.24
/ 0.06), i.e. with very few selected features (7 / 2) with
respect to total features in the group (3,042), achieved the
best correlation of all groups for both languages. For Valence,
it is difficult to infer intuitively which prosodic features are
related to positive and negative emotions. For example, we
may think of negative emotions with opposite Energy values
as angry whose Energy is high and sad whose Energy is
low. Or in positive emotions with opposite values of Times,
as excited whose Times are quick and relaxed whose Times
are slow. On the other hand, for Activation is reasonable to
788
TABLE II
English / German / Bilingual Feature Selection Results - Valence
Feature Group Total Selected Correlation Share PortionVoice Quality 36 6 / 3 / 9 0.29 / 0.08 / 0.26 9.67 / 4.76 / 5.92 16.66 / 8.33 / 25.00
Times 125 1 / 1 / 6 0.33 / 0.05 / 0.21 1.61 / 1.58 / 3.95 0.80 / 0.80 / 4.8Cochleagrams 96 8 / 0 / 10 0.40 / - / 0.33 12.90 / 0.00 / 6.58 8.33 / 0.00 / 10.42
LPC 111 13 / 8 / 14 0.34 / 0.34 / 0.43 20.96 / 12.69 / 9.21 11.71 / 7.20 / 12.61sFlux 117 0 / 1 / 8 - / 0.30 / 0.39 0.00 / 1.58 / 5.26 0.00 / 0.85 / 6.84
Energy 129 0 / 0 / 6 - / - / 0.12 0.00 / 0.00 / 3.95 0.00 / 0.00 / 4.65F0 243 0 / 4 / 6 - / 0.34 / 0.30 0.00 / 6.35 / 3.95 0.00 / 1.64 / 2.47
SpecMaxMin 233 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00SEB 469 0 / 5 / 10 - / 0.33 / 0.29 0.00 / 7.93 / 6.58 0.00 / 1.07 / 2.13
SROP 468 0 / 0 / 6 - / - / 0.20 0.00 / 0.00 / 3.95 0.00 / 0.00 / 1.28MFCC 1,617 27 / 39 / 51 0.48 / 0.29 / 0.39 43.54 / 61.90 / 33.55 1.67 / 2.41 / 3.15MEL 3,042 7 / 2 / 26 0.51 / 0.34 / 0.41 11.29 / 3.17 / 17.10 0.24 / 0.06 / 0.85
All 6,920 62 / 63 / 152 0.61 / 0.49 / 0.53 100 / 100 / 100 0.89 / 0.91 / 2.20
TABLE III
English / German / Bilingual Feature Selection Results - Activation
Feature Group Total Selected Correlation Share PortionVoice Quality 36 3 / 4 / 6 0.08 / 0.33 / 0.31 5.00 / 8.16 / 5.45 8.33 / 11.11 / 16.67
Times 125 1 / 0 / 1 0.16 / - / 0.03 1.66 / 0.00 / 0.91 0.80 / 0.00 / 0.80Cochleagrams 96 24 / 7 / 32 0.77 / 0.71 / 0.75 40.00 / 14.28 / 29.09 25.00 / 7.29 / 33.33
LPC 111 4 / 3 / 6 0.61 / 0.72 / 0.65 6.66 / 6.12 / 5.45 3.60 / 2.70 / 5.41sFlux 117 0 / 3 / 4 - / 0.76 / 0.57 0.00 / 6.12 / 3.64 0.00 / 2.56 / 3.42
Energy 129 4 / 4 / 7 0.76 / 0.59 / 0.61 6.66 / 8.16 / 6.36 3.10 / 3.10 / 5.43F0 243 1 / 4 / 4 0.29 / 0.54 / 0.44 1.667 / 8.16 / 3.64 0.412 / 1.64 / 1.65
SpecMaxMin 234 0 / 0 / 0 - / - / - 0.00 / 0.00 /0.00 0.00 / 0.00 / 0.00SEB 234 0 / 1 / 1 - / 0.65 / 0.31 0.00 / 2.04 / 0.91 0.00 / 0.21 / 0.21
SROP 468 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00MFCC 1,617 23 / 22 / 42 0.77 / 0.77 / 0.75 38.33 / 44.89 / 38.18 1.42 / 1.36 / 2.59MEL 3,042 0 / 1 / 7 - / 0.65 / 0.59 0.00 / 2.04 / 6.36 0.00 / 0.32 / 0.23
All 6,920 60 / 49 / 110 0.79 / 0.82 / 0.81 100 / 100 / 100 0.86 / 0.70 / 1.58
TABLE IV
English / German / Bilingual Feature Selection Results - Dominance
Feature Group Total Selected Correlation Share PortionVoice Quality 36 0 / 4 / 4 - / 0.29 / 0.26 0.00 / 6.67 / 4.25 0.00 / 11.11 / 11.11
Times 125 2 / 0 / 2 0.35 / - / 0.07 6.45 / 0.00 / 2.13 1.60 / 0.00 / 1.60Cochleagrams 96 7 / 12 / 16 0.70 / 0.75 / 0.69 22.58 / 20.00 / 17.02 7.29 / 12.50 / 16.67
LPC 111 1 / 2 / 3 0.66 / 0.60 / 0.64 3.22 / 3.33 / 3.19 0.90 / 1.80 / 2.70sFlux 117 1 / 2 / 1 0.61 / 0.73 / 0.53 3.22 / 3.33 / 1.06 0.85 / 1.70 / 0.85
Energy 129 2 / 5 / 12 0.15 / 0.65 / 0.67 6.45 / 8.33 / 12.76 1.15 / 3.87 / 9.30F0 243 3 / 4 / 7 0.22 / 0.52 / 0.40 9.67 / 6.66 / 7.45 1.23 / 1.64 / 2.88
SpecMaxMin 234 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00SEB 234 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00
SROP 468 0 / 0 / 0 - / - / - 0.00 / 0.00 / 0.00 0.00 / 0.00 / 0.00MFCC 1,617 11 / 30 / 44 0.71 / 0.74 / 0.70 35.48 / 50.00 / 46.81 0.68 / 1.86 / 2.72MEL 3,042 4 / 1 / 0 / 5 0.68 / 0.63 / 0.56 12.90 / 1.67 / 5.32 0.13 / 0.03 / 0.16
All 6,920 31 / 60 / 94 0.74 / 0.81 / 0.74 100 / 100 / 100 0.44 / 0.86 / 1.36
think that the faster and louder we speak the more active we
are perceived to be and that the slower and lower we speak
the more passive we are perceived to be [22]. Therefore,
intuitively, groups of features that model prosodic aspects
such as Energy and Times should be more important to
estimate Activation. In the experiments we could confirm
that indeed Energy was important in both languages (0.76
/ 0.59), but Times did not provide valuable information to
estimate this primitive. This makes us doubt that the Times
features used here are reflecting adequately the phenomena
related to speed for the languages we are working with.
For Activation the best group was MFCC (0.77 / 0.77),
with a high percentage of the total features of the final set
(38.33 / 44.89). We can see that this group by itself gets
a performance similar to the performance of the solution
set (0.79 / 0.82). Other important groups for Activation are
Cochleagrams that obtained good results for both languages
(0.77 / 0.71), LPC (0.61 / 0.72) and as mentioned previously
789
Energy (0.76 / 0.59). This primitive did not show significant
differences when estimating it in both languages (0.79 /
0.82). As for Activation, for Dominance the best groups
were MFCC and Cochleagrams. This Primitive shows many
similarities in both languages. Almost all groups in both
languages agreed on the degree of importance. The group
with less agreement is Energy that intuitively might be good
to indicate that the person tries to control a situation. For
English, Energy proved to be a poor indicator, but good
for German (0.15 / 0.65). Other important groups for this
primitive are LPC (0.66 / 0.60), sFlux (0.61 / 0.73) and MEL
(0.68 / 0.63). In German, almost twice as many features as in
English were selected, while retaining the selected features
proportion of each group reflected in Share and Portion.
B. Bilingual Feature Selection
In the case of Valence, the best correlation was obtained
with LPC and MEL. We can see that when we estimate
bilingual Valence, correlation (0.53) was at an intermediate
point between the mono-lingual English (0.61) and mono-
lingual German (0.49). Bilingual selection added attributes
that had not been considered in a mono-lingual way such
as Energy and SROP. For the case of Activation, the best
correlation for a group was obtained by MFCC, Cochlea-
grams, LPC and Energy, similarly to what had happened for
English. We can see that the correlation when estimating
Activation in the bilingual set was good (0.81) compared
with the results obtained by estimating Activation in the
two languages separately (0.79 / 0.82). Indeed, correlation
when estimating the two languages is better than the cor-
relation when estimating only English. For Dominance, the
best groups were MFCC, Cochleagrams, and Energy. That
agrees with the mono-lingual selection. The correlation when
estimating bilingual Dominance was the same as English
estimation (0.74) and worse than German estimation (0.81).
VII. CROSS-LINGUAL PERFORMANCE
A. Experiment 1
We use the selected features in a cross-lingual mode. That
is, we used the features found in English to estimate the
Emotion Primitives in German and vice-versa. To evaluate
this experiment we performed a 10-fold cross validation,
using only data from one language at a time. The idea of
this scenario is to analyze whether the features that are good
for a particular language are also good for the other. For
example, we see that the correlation for Valence in German
using the features found in German is 0.49, while using those
found in English greatly decreased to 0.33. For Activation,
it decreases from 0.82 to 0.78 that is a much smaller
decrease. In all cases, the correlation in the estimation of
the Primitives decreases. Although in some cases is much
larger the difference.
B. Experiment 2
We used the features found in bilingual selection. To
evaluate it, we trained with the instances of a language and
tested with the instances of the other language. The idea in
TABLE V
Experiment 1 Correlation index obtained in mono-lingual emotion
primitives estimation for cross-lingual feature selection
Emotion Primitive English GermanValence 0.5163 0.3372
Activation 0.7678 0.7867Dominance 0.7080 0.7779
this scenario is to examine whether the patterns learned to
estimate the Emotion Primitives in a language can be used to
estimate the Emotion Primitives in another language. In this
case, we can see that greatly decreased the accuracy of the
estimations and that this task is difficult, especially using the
patterns learned in English to estimate Emotion Primitives in
German. Conversely, the deterioration is not as strong.
TABLE VI
Experiment 2 Train/Test Correlation index obtained in cross-lingual
primitives estimation for bilingual feature selection
Emotion Primitive German/English English/GermanValence 0.2676 0.0146
Activation 0.6183 0.5470Dominance 0.6927 0.1156
C. Experiment 3
We used the features found in the bilingual selection.
To evaluate, we performed a 10-fold cross validation using
data of only one language at a time. The idea here is to
check whether the features found in the bilingual selection
are complementary in some way when estimating samples
of one language at a time. We can see that it was not
true since, apparently, there are features that help in one
language, but somehow affect the estimation of the other
given that correlation decreases. For example, Activation
in this scenario is (0.75 / 0.80) while doing mono-lingual
selection we obtained (0.79 / 0.82).
TABLE VII
Experiment 3 Correlation index obtained in mono-lingual emotion
primitives estimation for bilingual feature selection
Emotion Primitive English GermanValence 0.5992 0.3168
Activation 0.7557 0.8048Dominance 0.7255 0.7861
VIII. CONCLUSIONS
We performed a study on the importance of different
acoustic features to determine the emotional state of indi-
viduals in two languages, English and German. The study is
based on a continuous three-dimensional model of emotions,
which considers the emotional state by estimating the level of
the emotion primitives Valence, Activation and Dominance.
We analyzed each Emotion Primitive separately. Through the
790
identification of the best features the automatic estimation
of Emotion Primitives becomes more accurately and thus
the recognition and classification of people’s emotional state
should improve. We have taken some ideas used in the study
of the impact of features in the classification of discrete
emotional categories, like Share and Portion metrics, and we
have applied them to the continuous approach. We divided
our 6,920 features set into 12 groups according to their
acoustic properties. We calculated some metrics for each
group in order to estimate their performance in the automatic
estimation of Emotion Primitives (Correlation Coefficient)
and their contribution to the final set of features (Share and
Portion). We worked with two databases of emotional speech.
We realized that Spectral feature groups are very important
for the three primitives. According to our results the most
important feature groups across languages are:
• Valence: LPC - MEL - MFCC - sFlux
• Activation: MFCC - Cochleagrams - LPC - Energy
• Dominance: MFCC - Cochleagrams - Energy - LPC
Clearly MFCC and LPC groups are very important to esti-
mate the three Emotion Primitives, since they appear among
the most important for the three primitives. Other spectral
information groups like sFlux, Cochleagrams and MEL were
important, we can conclude from this fact that Spectral
analysis is more important than Prosodic and Voice Quality
analysis for Emotion Primitive estimation. The prosodic
features group Energy was also very important for Activation
and Dominance. In the experiments performed here we can
see that the features selected for a particular language had
an acceptable performance in the other language. Correlation
was better when we used features selected for only one
language, but the difference was not very notable. On the
other hand, we see that when we crossed the trained models
to estimate Emotion Primitives, the correlation decreased
dramatically, mainly for Dominance and Valence. From these
two facts, we can conclude that emotional states can be
estimated using a similar set of acoustic features. However,
the patterns shown by these features are difficult to take from
one language to another without any adaptation. When we
learned from both languages we obtained a lower correlation
than the highest monolingual correlation and higher or equal
to the lowest monolingual correlation, so we can conclude
that it is possible to identify common patterns in both
languages using a feature set that works for both languages.
Databases differ in some other aspects. Differences attributed
to language may be magnified by other reasons. An aspect
we could look at, for example, is the acoustic features
performance on acted emotions in a controlled environment
(IEMOCAP) versus spontaneous emotions in an uncontrolled
environment (VAM). It is not clear that results are better
or worse for either of the two languages. It is expected
that results were better for the acted database. However, for
Activation and Dominance results are better for VAM. Other
factor that may be affecting is that the number of instances
of IEMOCAP used is less than half that used with VAM.
REFERENCES
[1] C. Darwin, The Expression of the Emotions in Man and Animals,Oxford University Press, Oxford,3. Ed., 1998, 1872/1998.
[2] H. A. Elfenbein, Mandal M. K., Ambady N., Harizuka S., and KumarS., “Cross-cultural patterns in emotion recognition: Highlightingdesign and analytical techniques,” Emotion, vol. 2, no. 1, pp. 75 – 84,2002.
[3] P. Ekman, “Universals and cultural differences in facial expressionsof emotion,” In: J. R. Cole, Ed., Nebraska Symposium on Motivation,pp. 207–283, 2002.
[4] R. W. Picard, Affective Computing, The MIT Press; 1st edition (July31, 2000), 2000.
[5] Bo Xie, Ling Chen, Gen-Cai Chen, and Chun Chen, Statistical FeatureSelection for Mandarin Speech Emotion Recognition, Springer Berlin/ Heidelberg, 2005.
[6] M. Lugger and B. Yang, “An incremental analysis of different featuregroups in speaker independent emotion recognition,” .
[7] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner,L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, and N. Amir,“Whodunnit - searching for the most important feature types signallingemotion-related user states in speech,” Comput. Speech Lang., vol. 25,no. 1, pp. 4–28, 2010.
[8] T. Polzehl, A. Schmitt, and F. Metze, “Approaching multilingualemotion recognition from speech - on language dependency of acous-tic/prosodic features for anger detection,” in Proc. of the FifthInternational Conference on Speech Prosody, 2010. Speech Prosody2010, May 2010.
[9] H. Schlosberg, “Three dimensions of emotion,” Psychological Review,vol. 61, no. 2, pp. 81–88, 1954.
[10] M. Lugger and B. Yang, “Cascaded emotion classification viapsychological emotion dimensions using large set of voice qualityparameters,” in International Conference on Acoustics, Speech, andSignal Processing (ICASSP 2008). 2008, pp. 4945 – 4948, Institute ofElectrical and Electronics Engineers.
[11] M. Wollmer, F. Eyben, B. Schuller, E. Douglas-Cowie, and R. Cowie,“Data-driven clustering in emotional space for affect recognition usingdiscriminatively trained lstm networks,” in Interspeech 2009. 2009, pp.1595–1598, International Speech Communication Association.
[12] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim,J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactiveemotional dyadic motion capture database,” Journal of LanguageResources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
[13] S. Narayanan, M. Grimm, and K. Kroschel, “The vera am mittaggerman audio-visual emotional speech database,” in Proceedings ofthe IEEE International Conference on Multimedia and Expo (ICME2008), 2008, pp. 865–868.
[14] H. Perez, C. Garcıa, and L. Villasenor, “Features selection for primi-tives estimation on emotional speech,” in International Conference onAcoustics, Speech, and Signal Processing (ICASSP 2010). 2010, pp.5138–5141, Institute of Electrical and Electronics Engineers.
[15] P. Boersma, “Praat, a system for doing phonetics by computer,” inGlot International 5:9/10 2001, 2001, pp. 341–345.
[16] K. Santiago, C. A: Reyes G., and M.P. Gomez G., “Conjuntos difusostipo 2 aplicados a la comparacion difusa de patrones para clasificacionde llanto de infantes con riesgo neurologico,” M.S. thesis, INAOE,Tonantzintla, Puebla, Mexico, 2009.
[17] A.L Reyes, Un Metodo para la Identificacion del Lenguaje Habladoutilizando Informacion Suprasegmental, Ph.D. thesis, INAOE, To-nantzintla, Puebla, Mexico, 2007.
[18] T. Dubuisson, T. Dutoit, B. Gosselin, and M. Remacle, “On theuse of the correlation between acoustic descriptors for the nor-mal/pathological voices discrimination,” EURASIP Journal on Ad-vances in Signal Processing, Analysis and Signal Processing of Oe-sophageal and Pathological Voices, vol. 10.1155/2009/173967, 2009.
[19] C. T. Ishi, H. Ishiguro, and N. Hagita, “Proposal of acoustic measuresfor automatic detection of vocal fry,” in Interspeech 2005. 2005, pp.481–484, International Speech Communication Association.
[20] F. Eyben, M. Wollmer, and B. Schuller, “openear - introducing themunich open-source emotion and affect recognition toolkit,” in Proc.4th International HUMAINE Association Conference on AffectiveComputing and Intelligent Interaction 2009, 2009, pp. 1–6.
[21] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods infeature selection,” Pattern Recognition Letters, pp. 1119–1125, 1994.
[22] R. Kehrein, “The prosody of authentic emotions,” in Speech ProsodyConference 2002, 2002, pp. 423–426.
791