Non-intrusive Emotion Recognition using Computer …lib.ugent.be/fulltxt/RUG01/002/300/741/RUG01-002300741...Non-intrusive Emotion Recognition using Computer Peripheral Input Analysis

Djairhó Geuens

Peripheral Input AnalysisNon-intrusive Emotion Recognition using Computer

Academic year 2015-2016Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellor: Olivier JanssensSupervisors: Prof. dr. ir. Sofie Van Hoecke, Prof. dr. ir. Rik Van de Walle

Djairhó Geuens

Peripheral Input AnalysisNon-intrusive Emotion Recognition using Computer

Academic year 2015-2016Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellor: Olivier JanssensSupervisors: Prof. dr. ir. Sofie Van Hoecke, Prof. dr. ir. Rik Van de Walle

Permission to Use

The author(s) gives (give) permission to make this master’s dissertation available for consul-

tation and to copy parts of this master’s dissertation for personal use.

In the case of any other use, the copyright terms have to be respected, in particular with

regard to the obligation to state expressly the source when quoting results from this master’s

dissertation.

Ghent, 1 June 2016

Preface

For a long time I have been fascinated by the promising possibilities of computer systems

that are emotionally aware. In these rapidly changing times, the original ways of computer

interaction are more than outdated and the need for a more human-like interaction grows.

Therefore, I am very glad to have had the opportunity to indulge in one of the aspects of this

research area. I would also like to convey my sincere appreciation to my counsellor Olivier

Janssens for his great guidance throughout the creation of this work. I would also like to

express my gratitude to the people that have participated in my study for being willing to

provide their personal data in order for this research to succeed. Finally, I would like to

thank Juanita Van Dam for the provided support during the entire time that I have spent as

a student and for reading and revising this work.

Djairho Geuens

Ghent, 1 June 2016

ii

Non-intrusive Emotion Recognition usingComputer Peripheral Input Analysis

by

Djairho Geuens

Master’s dissertation submitted in order to obtain the academic degree of

Master of Science in Computer Science Engineering

Academic year 2015–2016

Supervisors: Prof. dr. ir. Sofie Van Hoecke, Prof. dr. ir. Rik Van de Walle

Counsellor: Olivier Janssens

Department of Electronics and Information Systems

Chair: Prof. dr. ir. Rik Van de Walle

Faculty of Engineering and Architecture

Ghent University

Abstract

Emotional intelligence is a crucial part of human communication and thus to build a truly

intelligent computer system, emotion recognition is required. A computer system that is able

to recognize emotions can use this information to acquire a much broader context to make

decisions, adapt itself or interact with the user. Current solutions for building artificial emo-

tional intelligence are limited by the fact that they require expensive and intrusive hardware.

Different studies have been conducted to overcome these issues using keystroke dynamics,

text content analysis, mouse movement analysis and other contextual factors. However, these

studies mainly use fixed patterns of text which again limits the real-world applicability. This

research combines these techniques without the aforementioned limitations and compares

them to build a system that can recognize emotions based on normal computer interaction

without influencing on the user. A field study was conducted to collect interaction data and

emotional state information. The interaction data was collected during normal user activ-

ity and the emotional state information was collected using self-reports. Machine learning

models were built for different emotional dimensions. From the cross-validation results, some

well-performing models are presented, which achieve accuracies up to 73%, that can be used

for real-world applications and can be further enhanced in future studies. We also provide

our dataset, additional to theoretical concepts and other suggestions that should be explored

in future work.

Index Terms

emotion, recognition, non-intrusive, keystroke, mouse, context, machine learning

iii

1

Non-intrusive Emotion Recognition using ComputerPeripheral Input Analysis

Djairho GeuensSupervisors: Sofie Van Hoecke, Rik Van de Walle

Counsellor: Olivier JanssensGhent University

Abstract—A computer system that is able to recognize emotionscan use this information to acquire a much broader context tomake decisions, adapt itself or interact with the user. Currentsolutions for building artificial emotional intelligence are limitedby the fact that they require expensive and intrusive hardware.These issues can be overcome using keystroke dynamics, textcontent analysis, mouse movement analysis and/or other factors.However, most studies in this research area focus on only one ofthese techniques and only use fixed patterns of text. A system thatcan recognize emotions based on normal computer interaction bycombining the previously mentioned techniques was built and byusing non-fixed text, the aforementioned limitation was negated.A field study was conducted to collect interaction data andemotional state information. The interaction data was collectedduring normal user activity and the emotional state informationwas collected using self-reports. From this data features wereextracted and different models were built based on these features.The best results in this research present classification accuraciesup to 73%. We also provide our dataset, additional to theoreticalconcepts and other suggestions that should be explored in futurework.

Index Terms—emotion, recognition, non-intrusive, keystroke,mouse, context, machine learning.

I. INTRODUCTION

NOWADAYS, we live in an era in which one cannotimagine the abscence of computers. They provide us

daily with support in all kinds of branches in society. Recently,a lot of progress has been made in the area of artificiallyintelligent systems and all sorts of applications of machinelearning. While the advantages of these techniques in everydaysystems are countless and undeniable, they are limited by theirlack of understanding and incorporation of human emotionalcontext in different processes.

To realize any form of emotional intelligence, differentrequirements have to be met. First of all, there needs to bea method that computer systems can use to observe affectivestates in users. During the last few years, research regardingsuch methods has increased. Unfortunately, the solutions thathave been proposed are often limited by different factors.A solution is desired to be non-intrusive for the user. Thisis important because when a user knows that he is beingobserved, it is possible that he – whether deliberately or not– will change his affective state. This is ofcourse undesirable.

D. Geuens is a graduate student at the Faculty of Engineering and Architec-ture, Ghent University, Ghent, Belgium e-mail: [email protected].

S. Van Hoecke, R. Van de Walle and O. Janssens are with the Departmentof Electronics and Information Systems at Ghent University

Furthermore, a minimum of required devices is desired be-cause it is unlikely that real-world applications are going tomake use of different kinds of additional sensor hardware.Often, such specialized equipment can be very expensivebecause it is not present in home or office environments bydefault or because it is medical equipment. This again limitsthe real-world application possibilities of the solution.

Systems that can automatically infer affective states byanalyzing the user’s typing behaviour and mouse movementsas well as text content and other contextual information wereinvestigated. The logging of keystrokes and mouse movementscan be done using a specific piece of software running in thebackground on the computer making it almost undetectableby the average user. Thus, the affective state is minimallyinfluenced by the data collection. Mouse and keyboard are alsostandard equipment in normal home and office environmentsand are very inexpensive. This gives many possibilities for theapplication of affective computing solutions using keystrokedynamics and mouse movements in a real-world environmenton a large scale.

A field study was conducted using experience sampling.Participants were recorded during their daily activities. Theintention was to record emotions while in the moment insteadof retrospectively (at a later time or another place). To enforcethis technique, software was installed on the participants’computers. The software ran as a background process on thecomputer. The participants were free to use their computersas usual in order to avoid external intrusiveness and influ-ence. Subsequently, different features were extracted from thecollected raw data. To build a model that can predict a user’semotional state accuracy, the use of Random Forest regressionmodels, Random Forest classification models, SVM regressionmodels and SVM classification models was examined. Twomain types of models were built: general models and individ-ual models. A general model entails one model that can predictthe emotional state regardless of the user. Such a model wouldbe of great value if it can make highly accurate predictions. Anindividual model will build one model for each user using onlydata from this user. This concept is based on the assumptionthat every person is unique and thus will behave uniquely indifferent emotional states.

2

II. RELATED WORK

A. Emotion Theory

Before discussing techniques to assess emotions, one shouldhave a good understanding of what is actually meant whenusing the words ”affect”, ”mood” and ”emotion”. In thisresearch the definitions given by Forgas [1] are used. Affect isused as a general term to refer to the combination of moodsand emotions. Moods have a rather low intensity, have along duration and little cognitive content. Emotions are muchmore intense, of a shorter duration and more clear concerningcognitive content. A lot of psychological research has beenperformed which resulted in a number of different modelsto explain human emotional behaviour. Two different modelsthat have been developed will be discussed: categorical anddimensional models.

The categorical or discrete model is based on how emotionsare described through language. Ekman presented six basicemotions [2]: anger, surprise, happiness, disgust, sadness andfear. He perfomed research for facial expressions and howpeople in different cultural environments recognize facialexpressions. He found that the six proposed basic emotionswere commonly recognizable in most cultures. The six basicemotions were later expanded to 15 basic emotions [3].However, many other sets of categories have been proposedas well [4].

Another very popular model is the dimensional model,presented by Russell [5]. This model defines emotional statesusing points in a continuous dimensional space. It was sug-gested that continuous space models perform better in out-of-lab application than discrete models [6]. The used dimensionalspace can be either uni-dimensional or multi-dimensional.The PANAS (Positive And Negative Affect Scales) modelis a popular uni-dimensional model [7]. The PAD (pleasure-arousal-dominance) model is a three-dimensional model thatwas developed by Mehrabian and Russell [8] and will alsobe used in this research. The pleasure dimension indicates thevalence measure of an emotional state. The arousal dimensionindicates the level of affective activation of an emotional state.The dominance dimension is used to indicate the amount ofpower or control a person experiences in an emotional state.Often, the PA model is used which is a simpler version of thePAD model leaving out the dominance dimension. However,this simpler PA model has been criticized [9] because it mightnot be possible to fully differentiate between several emotions(e.g. anger and fear). Furthermore, it was found after analyzingdata from Bradley & Lang’s experiment [10] that emotionaldata scattered like a V-shape and showed some clear holesin the PA space. Recently, Lovheim [11] proposed a newthree-dimensional emotion model that uses the monoaminesserotonin, dopamine and noradrenaline neurotransmitters asdimensions instead of pleasure, arousal and dominance.

B. Emotional Experimentation Environment

There are two main approaches to collect emotion data. Oneapproach is to induce moods in the participants in a laboratorysetting and then collecting the desired data. This requires amood induction procedure (MIP). A number of MIP categories

exist [12], including imagination, social interaction and usingfilm or story. While MIPs have the advantage of being ableto control the moods of the participant, the experiments oftentake a considerably long time and do not always guaranteesuccessful induction. Furthermore, individuals may react dif-ferently in a laboratory setting compared to a more naturalisticsetting because the experimental situation may influence theindividual [13].

A naturalistic setting during an experiment aims to observeparticipants in their natural setting to have minimal influ-ence on the participant’s behaviour. The experience samplingmethodology (ESM) [14] is such an approach. In this tech-nique a participant’s experiences are recorded as they occurat certain moments in time. This has the advantage that itcan capture daily life from moment to moment without theproblem of recall issues and that it is much easier for partici-pants to indicate their experiences at the moment that they areexperiencing them. A main disadvantage of using a naturalisticsetting is that all techniques provide the experimenter withsubjective information. This could lead to biased informationand individuals may repress certain information or change theirresponses to fit the norm of the participant’s culture.

C. Keystroke Dynamics

Keystroke dynamics is the study of the characteristicsthat are present in a user’s typing rhythm when using akeyboard. An individual’s typing behaviour can change whenthe individual experiences different emotions. This means thatinformation about a user’s typing behaviour may allow usto infer the user’s emotional state. Two main approaches forkeystroke dynamics analysis exist: fixed text analysis and freetext analysis. The former usually requires participants to typeone or more fixed pieces of text multiple times during the datacollection process. Using fixed text to build a model impliesthat this model can only be used at those moments that theuser types one or more of the fixed pieces of text on whichthe model was trained. The latter allows any sequence of textas input. This enables the model to be used during continuousmonitoring, which is very desireable for affect recognition be-cause this means that emotional information is available in thecomputer system at any time. Furthermore, using this approachit is also possible to obtain the keystroke data unobtrusively,which is also highly desireable. Typical keystroke dynamicsfeatures are keypress duration and latencies between multiplekeystrokes.

D. Text Content Analysis

The actual content of text contains a lot of valuable informa-tion concerning affect. A possible starting point of a linguisticanalysis of text to extract emotional information is to usespecific affective lexicons. Examples of such lexicons are theAffective Norms for English Words (ANEW) [15], SentiWord-Net [16] and WordNet Affect [17]. Clore et al. [18] arguedthat words need to be distinguished based on the fact whetherthey directly refer to emotional states or contain an indirectreference that depends on the context. A more abstract analysisusing specific textual features is also possible. Vizer et al. [19]

3

used a number of features defined by Zhou et al. [20]. It isalso possible to calculate features based on annotated datasetsand using these in a machine learning algorithm.

E. Mouse Behaviour

Except for text content and keystroke dynamics, also mousebehaviour may contain usefull information on an individual’semotional state. One can distinguish single clicks, doubleclicks, either using the left, right or maybe middle button.Furthermore, one can observe mouse movements and mousewheel movements. From the mouse movement data, the dis-tance, angle and speed features between pairs of data pointscan be extracted.

F. Emotion Recognition

There have been numerous studies investigating the possibleapplications of keyboard and mouse behaviour in authentica-tion and security [21–26] but for this work the applicationsregarding emotion recognition are more interesting.

Zimmerman et al. [27] described a method to correlatekeyboard and mouse interaction with affective states using aMIP and the PA emotion model. Later, Vizer et al. [19] pro-posed a new way of assessing cognitive and physical stress byanalyzing keystroke dynamics using both content and timingfeatures. They achieved correct classification rates of 62.5%for physical stress and 75% for cognitive stress. Epp et al. [28]focused more on actual emotions and investigated the possi-bility to identify different emotional states using keystrokeanalysis. They focused on gathering keystroke data in anatural context using the experience sampling methodologyrather than a laboratory environment. They achieved reliableaccuracy rates for fixed text ranging from 77.4% to 87.8%for confidence, hesitance, nervousness, relaxation, sadness andtiredness. Tsoulouhas et al. [29] also introduced a method todetect student boredom during the attendance in an onlinelesson that was followed in a laboratory setting and achieved acorrect classification rate for fixed text above 90%. Except foranalyzing keystroke and mouse features, it is also possibleto use other additional sensors that are present in today’ssmartphones. LiKamWa et al. [30] performed an experimentalstudy, which they extended in [31], to classify different moodsusing this kind of data. Continuing in the area of smartphones,Lee et al. [32] investigated the possibilities to automaticallyrecognize emotions in social network service posts. Theyachieved an average classification accuracy of 67.5%. Later,Tsui et al. [33] validated the hypothesis about the existence ofthe difference in typing patterns between different emotionalstates using the facial feedback hypothesis. Nahin et al. [34]focused on textual contents as well as keystroke dynamics.They used WordNet and the ISEAR dataset in combinationwith a vector space model. Hernandez et al. [35] presenteda method to evaluate a user’s boredom and frustration inan intelligent learning environment but focused on free textin contrast to [29]. Recently, Lee et al. conducted anotherstudy [36] that examined the variance in keystroke typingpatterns caused by emotions using visual stimuli to induceemotional states. They concluded that the effect of emotion

is significant but small compared to the individual variabil-ity. Along with the increasing popularity of fuzzy logic,Shukla et al. [37] and Bakhtiyari et al. [38–40] experimentedwith fuzzy models. Finally, a number of meta-studies wereconducted by Kolakowska [41], [42] and also a number ofinteresting possible applications were presented in [43].

The best aspects of these studies are combined in order topresent a methodology for this research.

III. METHODOLOGY

A. Data Collection

The first step in this work is the collection of data. Afield study was performed in which participants’ keystroke,mouse, location and weather data were gathered together withsubjective indications of emotional states in the PAD emotionmodel. The study used an experience sampling method. Theparticipants were periodically requested to indicate their emo-tional state while other data was continuously collected in thebackground by custom built software that was installed on theparticipants’ computers.

The field study was conducted from November 15th, 2015until May 1st, 2016 with 14 participants contributing datafor, on average, 22 weeks. There were no restrictions ontheir activities during the study. Upon signup, each participantprovided their name, e-mail address, gender, birthdate, place ofbirth, occupation, education, nationality, first language, mostused language on the computer, dominant hand, typing skills,computer skills, percentage of total computer time spent onthe concerned computer, keyboard layout, mouse type andcomputer type. All participants used the Windows operatingsystem (Windows 7 or higher).

B. Data Processing

The second step before being able to build models was theprocessing of the data to extract features.

1) Keystroke Features: The features, that were extractedfrom the keydown and keyup events, consist of timing featuresand frequency features.

To be able to calculate keystroke timing features, corre-sponding keydown and keyup events needed to be matched.Therefore, some special cases needed to be considered as well(e.g. multiple keydown event corresponding with one keyupevent). The average typing speed and the mean, weightedmean, maximum, minimum, standard deviation, variance,mode, median, skew and kurtosis of the down-to-down latency,up-to-up latency, up-to-down latency and down-to-up durationwere extracted. This results in 41 features. An outlier removalprocess was also applied to the latencies and durations usingthe interquartile range of the values in order to deal with longperiods of inactivity.

The frequency features that were extracted were thenumerical character frequency, alphabetical character fre-quency, delete frequency, backspace frequency, error frequency(backspace or delete), shift frequency, space frequency, arrowfrequency, caps lock frequency, return frequency, punctuationfrequency, average word length and long pause frequency.

4

2) Textual Content Features: Before extracting textual con-tent features, the pieces of text contained by each sampleneeded to be reconstructed from the keydown events. Next, thetext pieces were converted to a matrix of token counts and thenthis matrix was transformed to a normalized term frequencyrepresentation. The number of features that are produced bythis technique depends on the collection of text pieces that isused as input.

3) Mouse Movement Features and Contextual Features:One mouse movement feature was extracted: the averagemouse speed. The contextual features that were extractedare the temperature, humidity, pressure and the discomfortindex. However, due to the fact that some of the participantshave disabled the location services on their computer, not allsamples contain weather data as this depends on the locationinformation.

C. Model Building

In this work, different types of machine learning modelswere built. A first subdivision is made based on the factwhether they are regression or classification models. Then, asecond subdivision is made based on the fact whether they arebuilt using data of all participants or individual participants.Finally a third subdivision is made based on the fact whetherthe models are built using keystroke dynamics data, mousedynamics data and contextual data or they are built usingtextual content data.

1) Regression: For the general regression models, threedegrees of freedom were used. The first is whether contextualdata, i.e. weather data, is used. This yields different results assome participants did not have their location services enabledwhich caused their data not to contain weather information.As the models that were used in this research are not capableof dealing with missing data, this means that all samplesthat do not contain the weather information needed to beremoved. The second degree of freedom is whether a modelwas built for each separate dimension of the PAD model orwhether one model was built to predict the entire PAD-tuple(joint). The third degree of freedom that was used is whetherfeature selection was performed or not. To perform featureselection, a random forest model was built using all featuresand then the 20 most important features were extracted.Afterwards, the actual regression(/classification) model wasbuilt using only these features. In total, 8 different randomforest models for general dynamics regression were built andevaluated. Individual dynamics regression models were builtfor 8 participants as only these participants who provided atleast 59 samples (such that the amount of samples was at leastequal to the number of possible features) were included. Thesame degrees of freedom were used to build multiple variantsof the individual models.

To build general text content regression models, the textcontent feature data of all participants was used to train aSVM for regression. There were no degrees of freedom inthese models as the SVM implementation that was used isnot capable of predicting the entire PAD-tuple, hence differentmodels for each dimension were built. Furthermore, all textual

TABLE IPAD MAPPING ACCORDING TO OCTANTS IN PAD SPACE

PAD octant EmotionP-A-D- BoredP-A-D+ DisdainfulP-A+D- AnxiousP-A+D+ HostileP+A-D- DocileP+A-D+ RelaxedP+A+D- DependentP+A+D+ Exuberant

content features are equally important by definition (no termimportances were defined such that each term is equally im-portant) so there can be no degrees of freedom here. Individualmodels were built for 13 participants.

Predicted PAD axis values needed to be interpreted asan emotional state, or better yet, a weighted combinationof multiple emotional states. This requires mapping discreteemotional states onto the PAD space. Different mappings havebeen proposed and empirical approaches to building suchmappings have been taken [44]. When we assume that a personexperiences a weighted combination of multiple emotions,fuzzy logic can be used to determine the extent to which eachemotion is present, given PAD axis values. This can be doneby defining a set of fuzzy rules that define the relationshipbetween the PAD axis values and each emotion. For example,if the mapping, presented in Table I, is used, fuzzy rulescan be defined of the form: IF pleasure IS positiveAND arousal IS negative AND dominance ISpositive THEN relaxed IS present. The fuzzyterms ”negative”, ”positive” and ”present” can then be de-fined by membership functions that take values between 0and 1. The IS-operator calculates the function value for thevariable on its left-hand side using the membership functionfor the fuzzy term on its right-hand side. The AND-operatorcan be implemented using a so-called t-norm and also forcombining multiple rules concerning the same fuzzy terms anddefuzzification there are multiple possible approaches [45].Applying such fuzzy rules to the predicted PAD axis valuesresults in a set of membership values for each emotion that canbe interpreted as the extent to which each of these emotionsare present in the current emotional state of a person. Usingfuzzy logic has the advantage of reducing the importance ofthe accuracy of predicted PAD axis values as less accuratevalues can still result in a correct conclusion concerning thedominant emotion due to the fact that fuzzy logic has thecapacity to take into account the inherent fuzziness of theemotion information. Furthermore, it is also possible to drawconclusions about which emotions are likely to occur at thesame time.

2) Classification: To be able to perform classification, thePAD emotion model first needs to be divided into a numberof different classes, i.e. discretized. Three different approacheswere used to divide the PAD emotion model in differentclasses. The first approach uses the k-means clustering algo-rithm to assign all samples that belong to the same cluster tothe same class so that k classes are obtained. The value of k

5

was chosen to be 8 to divide the entire PAD space into differentclasses. This choice is made based on the fact that the PADspace is three-dimensional and thus contains 8 octants. If onlyone PAD dimension needed to be divided into different classes,the value of k was chosen to be 2. The second approach splitseach PAD dimension into two parts (a negative and a positivepart, respectively having values between 0 and 50 and between50 and 100). Using this approach, two classes per dimensionor 23 = 8 classes for the entire PAD space are obtained. Thethird approach is similar to the second one but splits each PADdimension into three parts (a negative, a neutral and a positivepart, respectively having values between 0 and 40, between40 and 60 and between 60 and 100). This yields three classesper dimension or 33 = 27 classes for the entire PAD space.The goal of classification models is then to predict the correctclass for each sample.

For the general dynamics classification models, six degreesof freedom were used. The first is whether contextual data,i.e. weather data, is used, as described above. The seconddegree of freedom is again whether a model was built foreach separate dimension of the PAD model or whether onemodel was built to predict the entire PAD-tuple. The thirddegree of freedom is whether a class system based on thek-means clusters is used, a system based on positive-negative-scales is used or a system based on positive-neutral-negative-scales is used. The fourth degree of freedom is whether noclass weight balancing is performed, general class weightbalancing is performed or subsample class weight balancing isperformed. General class weight balancing associates weightswith classes that are inversely proportional to class frequenciesin the input data. Subsample class weight balancing essentiallydoes the same thing but adjusts the class weights accordingto the class frequencies in the bootstrap sample for eachtree grown. The fifth degree of freedom is whether or notsubsampling is performed in order to reduce bias due toclass skew. Subsampling removes random samples from eachclass that contains more samples than the class that occursleast frequently until every class contains an equal numberof samples. The sixth degree of freedom that was used iswhether feature selection was performed or not. The featureselection process is the same as for the regression models. Intotal, 96 different random forest models for general dynamicsclassification were built and evaluated. For the individualdynamics classification models the same degrees of freedomas for the general dynamics classification models were used tobuild multiple variants of the models. They were built for thesame 8 participants as in the individual dynamics regressionmodels.

To build general text content classification models, the textcontent feature data of all participants is used to train a SVMfor classification. There are four degrees of freedom in thesemodels. The first degree of freedom is again whether a modelwas built for each separate dimension of the PAD model orwhether one model was built to predict the entire PAD-tuple.The second degree of freedom is the type of class systemthat is used. The third degree of freedom is the type of classweight balancing that is used. The fourth degree of freedom iswhether or not subsampling is performed. In total, 18 different

SVM models for general text classification were built andevaluated. For the individual text content classification modelsthe same degrees of freedom as for the general text contentclassification models were used to build multiple variants ofthe models. They were built for the same 13 participants as inthe individual text content regression models.

IV. RESULTS

The results for all regression models were not satisfying anddid not perform much better than always predicting the meanand therefore will not be presented here.

The highest ROC AUC score (area under the receiver oper-ating characteristic curve) for general dynamics classificationmodels is obtained with the combination of not includingweather data, using the joint PAD space that is dividedinto classes using k-means clustering with k = 8, not usingclass weight balancing, using subsampling and using featureselection. This model was built using 704 labeled samplesand achieves a ROC AUC score of 0.75. The correspondingconfusion matrix is presented in Table II. However, all modelsbuilt on separate PAD dimensions using classes determined bythe k-means clusters (with k = 2) or by the positive/negativeclass definition, no class weight balancing and subsamplinghave similar ROC AUC scores to this model but also havemuch higher precision, recall and F1 scores (≈ 0.73). Hence,the latter models actually perform better. It was observed thatthe same types of models also perform best for individualdynamics classification, general text content classification andindividual text content classification when consdering preci-sion, recall, F1 and ROC AUC scores. The confusion matrix ofthe best general text content classification model is presentedin Table III.

V. DISCUSSION

A. Overall Results

The bad results for the regression models do not mean thatthey cannot be used for this application. The best explanationfor the bad performances is the limited size and bias of thedataset. Possibly, a larger dataset that contains more uniformlydistributed samples will yield better regression models. Also,the regression models that were built using the text contentfeatures did not perform well. This can again be explainedby the limited size of the dataset. The amount of text contentfeatures that are extracted on average is much higher than theamount of samples that are available, resulting in a very sparsefeature space.

The classification models presented much better results anda clear pattern of a family of models that perform very well.The fact that the models using k-means clustering to generateclasses perform very similar to the models using the classesgenerated according to the positive/negative class definitioncan be explained by the value distributions for each PAD axis.These distributions contain two clusters of values for each PADaxis, which will very likely be the same clusters that the k-means clustering algorithm will find. Indeed, inspecting theclasses generated by the k-means algorithm (with k = 2) con-firms this presumption. Presumably, the presence of these two

6

TABLE IICONFUSION MATRIX FOR GENERAL DYNAMICS CLASSIFICATION MODEL WITH BEST ROC AUC SCORE

1 2 3 4 5 6 7 81 55.68% 5.68% 5.68% 7.92% 5.68% 14.8% 3.44% 1.12%2 7.92% 54.56% 4.56% 2.24% 10.24% 3.44% 6.8% 10.24%3 7.92% 7.92% 40.88% 5.68% 13.6% 3.44% 7.92% 12.48%4 4.56% 7.92% 1.12% 65.92% 6.8% 2.24% 5.68% 5.68%5 6.8% 5.68% 6.8% 3.44% 62.48% 10.24% 4.56% 0%6 3.44% 1.12% 3.44% 2.24% 6.8% 77.28% 2.24% 3.44%7 6.8% 9.12% 4.56% 7.92% 4.56% 6.8% 51.12% 9.12%8 5.68% 6.8% 12.48% 4.56% 3.44% 2.24% 12.48% 52.24%

TABLE IIICONFUSION MATRIX FOR BEST GENERAL TEXT CONTENT CLASSIFICATION MODEL

1 2 3 4 5 6 7 81 50% 17.04% 1.12% 9.12% 10.24% 4.56% 4.56% 3.44%2 4.72% 51.12% 2.24% 4.56% 13.6% 6.8% 11.36% 5.68%3 5.68% 11.36% 55.68% 2.24% 6.8% 2.24% 10.24% 5.68%4 7.92% 15.92% 11.36% 34.08% 5.68% 1.12% 13.6% 10.24%5 10.24% 21.6% 13.6% 2.24% 34.08% 5.68% 5.68% 6.8%6 5.68% 12.48% 4.56% 4.56% 13.6% 45.44% 7.92% 5.68%7 3.44% 18.16% 3.44% 9.12% 6.8% 4.56% 52.24% 2.24%8 9.12% 22.72% 6.8% 10.24% 17.04% 3.44% 3.44% 27.28%

obvious clusters is caused by the tendency of the participantsto always move the sliders while indicating their emotionalstate, resulting in the slider being either positioned to the leftor right of the center but never in the center. These two clustersalign very closely to the positive/negative class definitionwhich means that these two ways of defining classes willyield very similar classes. Since the produced classes are verysimilar, the results will also be very similar. One of the mainreasons for the better performance of classification models canprobably be attributed to the subsampling technique, solvingthe class skew problem. This was one of the main limitationsfor the regression models. A similar intelligent technique couldmaybe be applied to the continuous dataset as well to makethe regression models perform better as well.

B. Contextual Data and Feature Selection

It seems that features concerning weather information (tem-perature, pressure, humidity and discomfort index) do notmake much of a difference for model performances. However,for regression models there is a pattern indicating that weatherfeatures can be useful. One of the main problems is thatincluding weather data causes a lot of samples to be removed,which in turn causes a decrease in model performance due toa limited dataset. At first sight, one might draw the conclusionthat weather data does not provide discriminative informationto the model. However, in classification tasks, this effect ismuch smaller and almost non-existent, indicating that weatherfeatures might provide useful data. Thus, due to the limitedsize of the dataset, the effects of using contextual data areobfuscated.

Finally, using feature selection for models does not have asignificant impact on the results. This is consistent with thefeature importances for the best general dynamics regressionand classification models provided by the Random Forestalgorithm.

VI. CONCLUSION

First, the pleasure-arousal-dominance emotion model wasused providing a more discriminative way of emotion defini-tion. This emotion model also allows for regression techniquesto be applied to the emotion recognition task. Second, opposedto many other studies that use fixed text, free text data wasused in this research. The potential for real-world applicationis much bigger when free text is used. Third, models werebuilt based on a combination of dynamics (both keystrokeand mouse) features and contextual features (i.e. weather data)and compared to investigate the added value of contextualfeatures. Fourth, both general models and individual modelswere built to investigate the possible advantages of using moreuser-specific models to make accurate predictions. Fifth, amethodology was presented to use regression models in morepractical settings using fuzzy logic. Sixth, well-performingclassification models were created for predicting two class-levels on each separate dimension of the PAD model basedon dynamics features. Finally, an easy-to-use dataset with ahierarchical structure was constructed and prepared for publicuse. One of the main problems in this research area is thatmost studies use a different dataset, different metrics, differentfeatures and/or different emotion models. This causes a lot ofdifficulties when comparing findings and results. By allowingour dataset to be used in other studies, the problem of theusage of different datasets and emotion models is solved.

One of the limitations of this study was the reliance on theparticipants for establishing the ground truth. To overcomethis, other techniques may be used for identifying the par-ticipant’s emotional state. Also the usage of textual contentcan be improved by detecting emotionally charged words, textnormalization, stemming techniques and language detection.Finally, clustered models could also be built. This is basedon the assumption that every person is unique. However, it isassumed that it is possible that there exists a finite number

7

of clusters or user profiles, that comprise people that showsimilar computer behaviour for each emotional state. It is thenpossible to build one model for each profile.

This study presents classification models that are suitablefor real-world application but it also imposes some ethicalconcerns for privacy, as the monitoring process that is useddoes not get noticed by the user.

REFERENCES

[1] J. P. Forgas, “Mood and judgment: the affect infusion model (aim).”Psychological bulletin, vol. 117, no. 1, p. 39, 1995.

[2] P. Ekman and W. V. Friesen, “Constants across cultures in the face andemotion.” Journal of personality and social psychology, vol. 17, no. 2,p. 124, 1971.

[3] P. Ekman, “An argument for basic emotions,” Cognition & emotion,vol. 6, no. 3-4, pp. 169–200, 1992.

[4] P. N. Johnson-Laird and K. Oatley, “The language of emotions: Ananalysis of a semantic field,” Cognition and emotion, vol. 3, no. 2, pp.81–123, 1989.

[5] J. A. Russell, “A circumplex model of affect.” Journal of personalityand social psychology, vol. 39, no. 6, p. 1161, 1980.

[6] H. Gunes, B. Schuller, M. Pantic, and R. Cowie, “Emotion representa-tion, analysis and synthesis in continuous space: A survey,” in AutomaticFace & Gesture Recognition and Workshops (FG 2011), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 827–834.

[7] D. Watson, L. A. Clark, and A. Tellegen, “Development and validation ofbrief measures of positive and negative affect: the panas scales.” Journalof personality and social psychology, vol. 54, no. 6, p. 1063, 1988.

[8] A. Mehrabian, “Basic dimensions for a general psychological theoryimplications for personality, social, environmental, and developmentalstudies,” 1980.

[9] C. Kaernbach, “On dimensions in emotion psychology,” in AutomaticFace & Gesture Recognition and Workshops (FG 2011), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 792–796.

[10] P. J. Lang, M. K. Greenwald, M. M. Bradley, and A. O. Hamm, “Lookingat pictures: Affective, facial, visceral, and behavioral reactions,” Psy-chophysiology, vol. 30, pp. 261–261, 1993.

[11] H. Lovheim, “A new three-dimensional model for emotions andmonoamine neurotransmitters,” Medical hypotheses, vol. 78, no. 2, pp.341–348, 2012.

[12] R. Westermann, K. Spies, G. Stahl, and F. W. Hesse, “Relative effec-tiveness and validity of mood induction procedures: a meta-analysis,”European Journal of Social Psychology, vol. 26, no. 4, pp. 557–580,1996.

[13] M. Martin, “On the induction of mood,” Clinical Psychology Review,vol. 10, no. 6, pp. 669–697, 1990.

[14] J. M. Hektner, J. A. Schmidt, and M. Csikszentmihalyi, Experiencesampling method: Measuring the quality of everyday life. Sage, 2007.

[15] M. M. Bradley and P. J. Lang, “Affective norms for english words(anew): Instruction manual and affective ratings,” Technical Report C-1, The Center for Research in Psychophysiology, University of Florida,Tech. Rep., 1999.

[16] A. Esuli and F. Sebastiani, “Sentiwordnet: A publicly available lexicalresource for opinion mining,” in Proceedings of LREC, vol. 6. Citeseer,2006, pp. 417–422.

[17] C. Strapparava, A. Valitutti et al., “Wordnet affect: an affective extensionof wordnet.” in LREC, vol. 4, 2004, pp. 1083–1086.

[18] G. L. Clore, A. Ortony, and M. A. Foss, “The psychological foundationsof the affective lexicon.” Journal of personality and social psychology,vol. 53, no. 4, p. 751, 1987.

[19] L. M. Vizer, L. Zhou, and A. Sears, “Automated stress detection usingkeystroke and linguistic features: An exploratory study,” InternationalJournal of Human-Computer Studies, vol. 67, no. 10, pp. 870–886, 2009.

[20] L. Zhou, J. K. Burgoon, J. F. Nunamaker, and D. Twitchell, “Automatinglinguistics-based cues for detecting deception in text-based asynchronouscomputer-mediated communications,” Group decision and negotiation,vol. 13, no. 1, pp. 81–106, 2004.

[21] R. S. Gaines, W. Lisowski, S. J. Press, and N. Shapiro, “Authenticationby keystroke timing: Some preliminary results,” DTIC Document, Tech.Rep., 1980.

[22] F. Monrose and A. Rubin, “Authentication via keystroke dynamics,” inProceedings of the 4th ACM conference on Computer and communica-tions security. ACM, 1997, pp. 48–56.

[23] F. Monrose and A. D. Rubin, “Keystroke dynamics as a biometric forauthentication,” Future Generation computer systems, vol. 16, no. 4, pp.351–359, 2000.

[24] M. Pusara and C. E. Brodley, “User re-authentication via mouse move-ments,” in Proceedings of the 2004 ACM workshop on Visualization anddata mining for computer security. ACM, 2004, pp. 1–8.

[25] M. Fairhurst, D. Costa-Abreu et al., “Using keystroke dynamics forgender identification in social network environment,” in Imaging forCrime Detection and Prevention 2011 (ICDP 2011), 4th InternationalConference on. IET, 2011, pp. 1–6.

[26] I. Traore et al., “Biometric recognition based on free-text keystrokedynamics,” Cybernetics, IEEE Transactions on, vol. 44, no. 4, pp. 458–472, 2014.

[27] P. Zimmermann, S. Guttormsen, B. Danuser, and P. Gomez, “Affectivecomputing – a rationale for measuring mood with mouse and keyboard,”International journal of occupational safety and ergonomics, vol. 9,no. 4, pp. 539–551, 2003.

[28] C. Epp, M. Lippold, and R. L. Mandryk, “Identifying emotional statesusing keystroke dynamics,” in Proceedings of the SIGCHI Conferenceon Human Factors in Computing Systems. ACM, 2011, pp. 715–724.

[29] G. Tsoulouhas, D. Georgiou, and A. Karakos, “Detection of learner’saffective state based on mouse movements,” J. Comput, vol. 3, pp. 9–18,2011.

[30] R. LiKamWa, Y. Liu, N. D. Lane, and L. Zhong, “Can your smartphoneinfer your mood,” in PhoneSense workshop, 2011, pp. 1–5.

[31] ——, “Moodscope: Building a mood sensor from smartphone usagepatterns,” in Proceeding of the 11th annual international conference onMobile systems, applications, and services. ACM, 2013, pp. 389–402.

[32] H. Lee, Y. S. Choi, S. Lee, and I. Park, “Towards unobtrusive emotionrecognition for affective social communication,” in Consumer Communi-cations and Networking Conference (CCNC), 2012 IEEE. IEEE, 2012,pp. 260–264.

[33] W.-H. Tsui, P. Lee, and T.-C. Hsiao, “The effect of emotion onkeystroke: an experimental study using facial feedback hypothesis,”in Engineering in Medicine and Biology Society (EMBC), 2013 35thAnnual International Conference of the IEEE. IEEE, 2013, pp. 2870–2873.

[34] A. N. H. Nahin, J. M. Alam, H. Mahmud, and K. Hasan, “Identifyingemotion by keystroke dynamics and text pattern analysis,” Behaviour &Information Technology, vol. 33, no. 9, pp. 987–996, 2014.

[35] A. Hernandez-Aguila, M. Garcia-Valdez, and A. Mancilla, “Affectivestates in software programming: Classification of individuals based ontheir keystroke and mouse dynamics,” Intelligent Learning Environ-ments, p. 27, 2014.

[36] P.-M. Lee, W.-H. Tsui, and T.-C. Hsiao, “The influence of emotion onkeyboard typing: an experimental study using visual stimuli,” Biomedicalengineering online, vol. 13, no. 1, p. 81, 2014.

[37] P. Shukla and R. Solanki, “Web based keystroke dynamics applica-tion for identifying emotional state,” Internation journal of advancedresearch in computer science and communication engineering, vol. 2,no. 11, pp. 4489–4493, 2013.

[38] K. Bakhtiyari and H. Husain, “Fuzzy model in human emotions recog-nition,” arXiv preprint arXiv:1407.1474, 2014.

[39] ——, “Fuzzy model of dominance emotions in affective computing,”Neural Computing and Applications, vol. 25, no. 6, pp. 1467–1477,2014.

[40] K. Bakhtiyari, M. Taghavi, and H. Husain, “Implementation ofemotional-aware computer systems using typical input devices,” inIntelligent Information and Database Systems. Springer, 2014, pp.364–374.

[41] A. Kolakowska, “A review of emotion recognition methods based onkeystroke dynamics and mouse movements,” in Human System Interac-tion (HSI), 2013 The 6th International Conference on. IEEE, 2013,pp. 548–555.

[42] ——, “Recognizing emotions on the basis of keystroke dynamics,” inHuman System Interactions (HSI), 2015 8th International Conferenceon. IEEE, 2015, pp. 291–297.

[43] A. Kołakowska, A. Landowska, M. Szwoch, W. Szwoch, and M. Wrobel,“Emotion recognition and its applications,” in Human-Computer SystemsInteraction: Backgrounds and Applications 3. Springer, 2014, pp. 51–62.

[44] H. Hoffmann, A. Scheck, T. Schuster, S. Walter, K. Limbrecht, H. C.Traue, and H. Kessler, “Mapping discrete emotions into the dimensionalspace: An empirical approach,” in Systems, Man, and Cybernetics(SMC), 2012 IEEE International Conference on. IEEE, 2012, pp. 3316–3320.

8

[45] L. A. Zadeh, “Soft computing and fuzzy logic,” IEEE software, vol. 11,no. 6, p. 48, 1994.

Contents

1 Introduction 1

1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Solution Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Parts of the Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Study Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.3 Feature Extraction and Selection . . . . . . . . . . . . . . . . . . . . . 4

1.3.4 Model Building and Validation . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Related Work 6

2.1 Emotions Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Affect, Mood and Emotions . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Emotion Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Emotional Experimentation Environment . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Laboratory Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Naturalistic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.4 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.6 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.7 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.8 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.9 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Keystroke Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Fixed Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.2 Free Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

xii

Contents

2.4.3 Keystroke Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Text Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Mouse Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 Authentication and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.8 Emotion Recognition in Computer Systems . . . . . . . . . . . . . . . . . . . 25

3 Data Collection 32

3.1 Field Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.2 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.3 Privacy Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.4 Meantime Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.5 Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Participant Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 Installation, Automatic Updates and Heartbeat . . . . . . . . . . . . . 37

3.3.2 Capturing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.3 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.4 Server and Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Data Processing 48

4.1 Data Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.1 Keystroke Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.2 Textual Content Features . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2.3 Mouse Movement Features . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.4 Contextual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 Model Building 56

5.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1.1 General Dynamics Models . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.2 Individual Dynamics Models . . . . . . . . . . . . . . . . . . . . . . . 58

5.1.3 General Text Content Models . . . . . . . . . . . . . . . . . . . . . . . 58

5.1.4 Individual Text Content Models . . . . . . . . . . . . . . . . . . . . . 59

5.1.5 Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2.1 General Dynamics Models . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.2 Individual Dynamics Models . . . . . . . . . . . . . . . . . . . . . . . 67

5.2.3 General Text Content Models . . . . . . . . . . . . . . . . . . . . . . . 70

xiii

Contents

5.2.4 Individual Text Content Models . . . . . . . . . . . . . . . . . . . . . 70

6 Discussion 74

6.1 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.2 Contextual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.4 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.5 Prediction Performance for PAD-dimensions . . . . . . . . . . . . . . . . . . . 78

6.6 Number of Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.7 Model Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.8 Clustered Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.9 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7 Conclusion 84

7.1 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.3 Potential for Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A Concept Matrix of Related Work 94

B Participant Registration Form 98

C Participant Consent Form 99

D Software Download Page 100

E Model Hyperparameters 101

E.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

E.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

xiv

Abbreviations

k-NN k -Nearest Neighbors. 12, 16, 24, 28, 95–97

ANEW Affective Norms for English Words. 22

ANN Artificial Neural Networks. 12, 95–97

ANOVA Analysis Of Variance. 28

API Application Programming Interface. 43

BN Bayesian Network. 12

CART Classification And Regression Tree. 13

CSV Comma-separated Values. 48

CWB Class Weight Balancing. 65, 68, 71, 73

D2D Down-to-down. 50, 51

D2U Down-to-up. 50, 51

DCS-LA Dynamic Classifier Selection using Local Accuracy. 24

DT Decision Tree. 12

EER Equal Error Rate. 24

EM Expectation Maximization. 19, 20

ESM Experience Sampling Methodology. 11

EVS Explained Variance Score. 56–58, 60, 74, 79, 102

FAR False Acceptance Rate. 24

FRR False Rejection Rate. 24

xv

Abbreviations

GMM Gaussian Mixture Model. 12

IAPS Internation Affective Picture System. 28, 97

IP Internet Protocol. 34

IQR Interquartile Range. 51

ISEAR Internation Survey on Emotion Antecedents and Reactions. 22, 27

JSON Javascript Object Notation. 46, 81

MAE Mean Absolute Error. 56–60, 74, 78, 79, 102

MIP Mood Induction Procedure. 11

MSE Mean Squared Error. 56–58, 60, 74, 79, 102

NB Naive Bayes. 12

NSIS Nullsoft Scriptable Install System. 47

PA pleasure-arousal. 8, 9, 25, 85, 94, 96, 97

PAD pleasure-arousal-dominance. xiv, 4, 8–10, 33, 36, 43, 44, 56–59, 61–65, 67, 68, 70–73,

75, 76, 78, 84–86, 101

PANAS Positive And Negative Affect Scales. 8, 28, 85, 97

PHP Hypertext Preprocessor. 34, 45

RF Random Forest. 12

ROC AUC Receiver Operating Characteristic Area Under Curve. 63, 64, 67, 68, 70, 74,

78–80, 86, 103

SQL Structured Query Language. 46, 48

SS Subsampling. 65, 68, 71, 73

SVM Support Vector Machine. 4, 12, 16, 30, 56, 58, 70, 95, 97

TF-IDF Term Frequency - Inverse Document Frequency. 23, 53

TLS/SSL Transport Layer Security/Secure Sockets Layer. 34

xvi

Abbreviations

U2D Up-to-down. 50–52

U2U Up-to-up. 50, 51

XML Extensible Markup Language. 41

xvii

Chapter 1

Introduction

1.1 Problem Definition

Nowadays, we live in an era in which one cannot imagine the abscence of computers. They

provide us daily with support in all kinds of branches in society. Recently, a lot of progress

has been made in the area of artificially intelligent systems and all sorts of applications of

machine learning. While the advantages of these techniques in everyday systems are countless

and undeniable, they are limited by their lack of understanding and incorporation of human

emotional context in different processes.

If a computer system would possess some form of emotional intelligence, it would have a

much broader view on contextual aspects during decision making. This context can be used

to dynamically adapt applications in order to enhance productivity, effectiveness and user-

friendliness. For example, an operating system could use emotional state information of its

user to better assist during interaction. Operating system developers are working very hard

to incorporate an artificially intelligent assistant that can be used by users to facilitate their

day-to-day activities. However, as these assistants require a certain amount of time to get to

know a user, there has to be some way to indicate whether or not the assistant is displaying

unwanted behaviour so that the assistant can improve itself over time using this information.

The operating system could also assist during text communication (such as e-mails and instant

messaging) to avoid ambiguity and misunderstandings between correspondents. It could use

both the content and the emotional tone of the message to enhance the structure and content

of the message for emotion expression. Monitoring of the emotional state could also have its

application in health care. For example a system could detect whether a user is close to a

burn-out or is in a depressed state for a long period of time and take appropriate actions.

Furthermore, mission-critical systems could detect when a user is in a fatigued, stressed or

distracted state and avoid catastrophic mistakes.

These are all situations in which a system takes actions as a response to certain emotions.

1

Chapter 1. Introduction

However, a computer system could also try to infer the cause of certain emotions through

different situation variables and respond to the emotion by adapting these variables. For

example, while playing a video game, when the system detects that the player is bored or

frustrated it can alter the course of the game to make it more challenging or relaxing (much

like feedback loops in control systems). This indicates that it is not only possible for an

emotionally intelligent system to observe emotional states but influence them as well. This

is an important thought as it means that such a system considerably approaches human

emotional capabilities. Hence, this means that it cannot be ruled out that it is possible for

a computer system to learn and simulate emotions of its own in response to observation of

human emotions. This is studied in affective computing [53]. It is thus clear that emotion

awareness is a crucial component of any system that can justifiably be called truly artificially

intelligent. Thinking of such systems and machines, one can ask himself a rather philosophical

question: What separates humans from such truly artificially intelligent machines?

To realize any form of emotional intelligence discussed here, different requirements have to

be met. First of all, there needs to be a method that computer systems can use to observe

affective1 states in users. During the last few years, research regarding such methods has

increased. Unfortunately, the solutions that have been found are often limited by different

factors. A solution is desired to be non-intrusive for the user. This means that the solution

should not invade the personal space of the user or become too noticeably involved in the

person’s life without being invited. The non-intrusiveness of the solution is important because

when a user knows that he is being observed, it is possible that he – whether deliberately or

not – will change his affective state. This is ofcourse undesirable. Furthermore, a minimum of

required devices is desired because it is unlikely that real-world applications will want to make

use of different kinds of extra sensor hardware. Often, such specialized equipment can be very

expensive because it is not present in home or office environments by default or because it is

medical equipment. This again limits the real-world application possibilities of the solution.

Except for these requirements, some other aspects of a solution need to be taken into account

to judge its applicability in a real-world environment. For example, it is typically desired for

a solution to improve itself over time. When two humans initially get to know each other,

it will be harder for one person to judge the affective state of the other person compared

to when these two persons have known each other for a longer period of time. The same

principle can be expected in a solution for emotion recognition systems. When a system

initially observes a user, it will only have basic emotion recognition capabilities. But as the

observation advances in time, the system can gradually adapt itself to the user and be able

to increase its recognition accuracy.

1The term affection can be defined as the whole of emotions and moods that a person experiences.

2


1.2 Solution Proposal

Taking the above requirements into account, a system that can automatically infer affec-

tive state by analyzing the user’s typing behaviour and mouse movements as well as some

contextual information can be a very promising solution.

Analysis of keystroke dynamics has been proposed as a solution for improved security in

authentication systems. Here, the goal is to identify a user based on his typing behaviour

so that a compromised password does not imply a compromised system anymore. However,

Monrose and Rubin [51] observed that a user’s typing rhythm changed from time to time.

This was attributed to a changing affective state, implying that the affective state can be

derived from keystroke dynamics.

The use of mouse movement analysis is encouraged by Zimmermann [69], who observed that

affect impacts motor-behaviour of computer users. It has also been used as a way to improve

authentication systems just like keystroke dynamics [54].

Automatic emotion recognition using keystroke dynamics and mouse movements have several

requirements that have been set above such as non-intrusiveness and a limited use of extra

sensor hardware. The logging of keystrokes and mouse movements can be done using a specific

piece of software running in the background on the computer making it almost undetectable

by the average user. Thus, the affective state is minimally influenced by the data collection.

Mouse and keyboard are also standard equipment in normal home and office environments and

are very inexpensive. This gives many possibilities for the application of affective computing

solutions using keystroke dynamics and mouse movements in a real-world environment on a

large scale.

1.3 Parts of the Solution

The development of the proposed recognition system can be split up in different parts. First,

a lot of data from different users needs to be collected and labeled with accurate emotional

information. Next, relevant features need to be extracted from the data. All of these features

need to be evaluated and the best features need to be selected to reduce the dimensionality of

the feature space and avoid sparsity of the dataset. Using the selected set of features, models

have to be built and validated to obtain a powerful model that can predict a user’s emotional

state with high accuracy.

1.3.1 Study Environment

To obtain a dataset containing keystroke information and mouse movements during different

emotional states, one needs to make sure that participants experience these emotions. Differ-

3


ent techniques exist to induce particular emotional states. When such a state is induced, the

participant can be asked to perform some actions on a computer that require using the mouse

and keyboard. The mouse and keyboard can then be monitored and information can be

collected. This technique is used in many studies. However, this technique is very intrusive.

Since the focus of this research is on developing a system that can be used in a real-world

environment, this needs to be avoided.

Another technique, which is used in this research, is called experience sampling [29]. Using

experience sampling, participants are recorded during their daily activities. The intention is

to record emotions while in the moment instead of retrospectively (at a later time or another

place). To enforce this technique, software was installed on the participants’ computers. The

software ran as a background process on the computer and was not noticeable except for a

small tray icon. The participants were free to use their computers as usual and once in a

while the software would request them to fill out a small emotional state questionnaire. This

questionnaire required the participants to indicate their emotional state using the pleasure-

arousal-dominance (PAD) emotion model.

1.3.2 Data Collection

As mentioned before, data collection was done using a piece of custom developed software that

ran as a background process when the computer was in use. The software recorded keystrokes,

mouse movements and location information, regardless of the program that was being used

at that moment. Thus, the data collection process was barely noticeable for the participants.

Once the software determined that enough data was collected, the user was asked to fill out

an emotional questionnaire. All collected data and the answers to the questionnaire were then

submitted to a server. A number of settings and features had to be taken into account in the

software to make sure that the answers to the questionnaire had a good correspondence with

the collected data and to make sure that no data was lost or submitted multiple times.

1.3.3 Feature Extraction and Selection

After the field study was completed, all data was collected from the server and needed to be

processed. Different features were extracted from both raw keystroke, raw mouse data and

raw location data. Once a complete set of features was obtained, a selection was made to

reduce the dimensionality and facilitate the machine learning process.

1.3.4 Model Building and Validation

To build a model that can predict a user’s emotional state with high accuracy, the use of Ran-

dom Forest regression models, Random Forest classification models, SVM regression models

and SVM classification models was examined. Two main types of models were built: general

4


models and individual models. A general model entails one model that can predict the emo-

tional state regardless of the user. Such a model would be of great value if it can make highly

accurate predictions. An individual model will build one model for each user using only data

from this user. This concept is based on the assumption that every person is unique and thus

will behave uniquely in different emotional states. To evaluate the models that were built,

10-fold cross-validation was used on the dataset.

1.4 Thesis Outline

Chapter 2 presents a summary of related literature that forms the basis for this research.

Topics that will be discussed are research in emotion assessment, emotion recognition tech-

nology and different environments for emotional experimentation. Next, research in keystroke

dynamics and mouse movements is discussed for both emotion recognition and authentication

purposes.

Chapter 3 presents the data collection process in detail. Both the application of the experience

sampling methodology and the software that was developed are discussed into depth. Different

settings and features that have been used in the software are explained.

Chapter 4 discusses the entire flow from raw data processing to feature extraction and feature

selection.

Chapter 5 describes the different models that were built. This includes the different machine

learning techniques that were used and the measures that have been taken to avoid overfitting

and bias. Finally, all results of the analysis are presented.

Chapter 6 evaluates the outcomes of the results from Chapter 5 and draws different conclusions

from these results.

Chapter 7 finalizes this research by giving a summary, assessing the possibilities for real-

world application of this research, discussing the lessons that were learned and identifying

the contributions that were made as well as the possibilities for future work.

5

Chapter 2

Related Work

In this chapter, the literature that forms the basis for this research is discussed. First, some

common terms in the field of emotion assessment are defined and different techiques that are

used in the research on affect and emotions are explained. Next, some remarks on the use

of keystroke dynamics and mouse movements for authentication and security purposes are

presented. The chapter finishes with a discussion of previous research on keystroke dynamics

and mouse movements and how it is put into the context of emotion recognition research.

2.1 Emotions Theory

2.1.1 Affect, Mood and Emotions

Before discussing techniques to assess emotions, one should have a good understanding of what

is actually meant when using the words ”affect”, ”mood” and ”emotion”. However defining

these terms seems easy, there is little general agreement on actual definitions for these words.

In this research the definition given by Forgas [26] will be followed. Affect is used as a general

term to refer to the combination of moods and emotions. Moods have a rather low intensity,

have a long duration and little cognitive content. Most of the time the subtility of moods

causes persons not to realize they are experiencing them until it is brought to their attention.

As opposed to moods, emotions are much more intense, of a shorter duration and more clear

concerning cognitive content. Emotions can be traced back to a particular cause much easier

and the individual is aware of the presence of an emotion.

2.1.2 Emotion Modelling

To be able to analyze the emotional state of an individual, emotions need to be made mea-

surable. A lot of psychological research has been performed which resulted in a number of

different models to explain human emotional behaviour. Five different models that have been

developed will be discussed: categorical, appraisal, dimensional, circuit and componential

6

Chapter 2. Related Work

models.

Categorical Model

The categorical or discrete model is based on how emotions are described through language.

When explaining emotions, we typically attach specific labels to different emotional experi-

ences. Typical examples are: ”I feel stressed” and ”I feel happy”.

This way, Ekman presented six basic emotions [22]: anger, surprise, happiness, disgust, sad-

ness and fear. He performed research for emotional expressions of the face and how people in

different cultural environments recognize facial expressions. He found that the six proposed

basic emotions were commonly recognizable in most cultures. The six basic emotions were

later expanded to 15 basic emotions [21]: amusement, anger, contempt, contentment, dis-

gust, ebarrassment, excitement, fear, guilt, pride in achievement, relief, sadness, satisfaction,

sensory pleasure and shame.

Determining which emotions are so-called ”basic” emotions is difficult. Emotions can be

primary but also be made of combinations of primary emotions. Defining characteristics of

such primary emotions is thus of critical importance in the categorical model. There is no real

consensus in research literature when it comes to these characteristics and as a consequence,

a number of different sets of categories have been proposed as alternatives for the six (or 15)

basic emotions of Ekman [34].

The fact that emotions are described through language immediately poses an important

problem that needs to be taken into account. As language differs in different parts of the world

this means that emotions will also be described using different languages. Furthermore, it is

possibe that one language contains words to describe an emotional experience that are absent

in another language and vice versa. This can be due to the fact that a word can describe

a more specific feeling. An emotion that is described by one word in a language could thus

have multiple more specific words in another language. The categorization of feelings and

emotions is thus strongly influenced by language and culture.

The discrete model clearly has a number of limitations but nonetheless it is still widely used

in psychology and affective computing research. The reason for the popularity of this model

is its simplicity. Classifying emotions using a set of labels is much easier than using one of

the models that will be discussed next. Almost all existing datasets contain language-defined

class labels. Although the discrete model is very popular in affective computing research, it

is difficult to compare different research results because very often, a different set of emotion

classes is used.

7


Appraisal Model

The appraisal model states that emotions are caused by the dynamic evaluation or appraisal

of events, situations and the environment in general [55]. The relationship of an individual

with its environment is assessed against a number of criteria. Emotions are then felt based

on these assessments. This means that emotions are modelled using the underlying cognitive

processes that precede them. For example, a hostile environment will cause a person to

evaluate the situation as being dangerous and therefore the person will experience an anxious

feeling based on this assessment. The appraisal model accounts for several phenomena that

cause problems for other models. For example, it can account for individual variances in

emotional reaction to the same situation. While the appraisal model can indeed account

for several phenomena, there also exist some issues. For example, it may be possible that

appraisal not only causes emotions but that emotions also cause appraisal. Other research [33]

indicates that emotions may also be caused by processes other than appraisals implying that

appraisals are not necessary causes of emotions.

Dimensional Model

Another very popular model is the dimensional model (also called the circumplex model),

presented by Russell [56]. This model defines emotional states using points in a continuous

dimensional space. It was suggested that continuous space models perform better in out-

of-lab application than discrete models [28]. The used dimensional space can be either uni-

dimensional or multi-dimensional. The PANAS (Positive And Negative Affect Scales) model

is a popular uni-dimensional model [65]. One clear disadvantage of the PANAS model is that

it represents a mixture of emotions, moods and affect.

The PAD (pleasure-arousal-dominance) model is a three-dimensional model that was devel-

oped by Mehrabian and Russell [48] and will also be used in this research. The pleasure

dimension indicates the valence measure of an emotional state. The arousal dimension indi-

cates the level of affective activation of an emotional state. The dominance dimension is used

to indicate the amount of power or control a person experiences in an emotional state. Often,

the PA model is used which is a simpler version of the PAD model leaving out the dominance

dimension. However, this simpler PA model has been criticized [35] because it might not

be possible to fully differentiate between several emotions. Furthermore, it was found after

analyzing data from Bradley & Lang’s experiment [40] that emotional data scattered like a

V-shape and showed some clear holes in the PA space.

8


Figure 2.1: PAD emotion model

Figure 2.2: PA emotion model

Recently, Lovheim [46] proposed a new three-dimensional emotion model that uses the monoamines

serotonin, dopamine and noradrenaline neurotransmitters as dimensions instead of pleasure,

arousal and dominance. Serotonin is closely related to obsession and compulsion, nora-

drenaline is related to alertness and concentration and dopamine is related to motivation.

This way, these neurotransmitters are closely related to human emotion and measuring them

and plotting them into this model could yield an almost direct representation of human emo-

tions.

9


Figure 2.3: Lovheim emotion model [8]

While the dominance dimension in the PAD model may not correspond to an actual physi-

ological system, it does provide the possibility to differentiate among emotions that have a

similar pleasure and arousal values (e.g. anger and fear) [19].

Circuit Model

LeDoux [41] proposed an alternative emotion model that states that emotions are caused by

different neural circuits in the brain. These circuits are determined by evolution. Neuropsy-

chologists have found several so-called survival circuits that cause primitive emotions such as

fear. This way, the circuit model can be related to the categorical model when the activated

circuits are used as labels for the corresponding emotions. The circuit model is less known

and less used than the other models outlined above. It is still limited to explain only primitive

emotions. At the same time however, this model has promising properties because unlike the

previously discussed models, this model is based on objective observations. Activation of

neural circuits can be monitored using biomedical imaging techniques.

2.2 Emotional Experimentation Environment

Different approaches can be chosen to collect the emotion data that is needed. An important

choice is the environment in which the experiment is organized. Two main approaches are

either a laboratory setting or a more naturalistic setting. In this section, both approaches

and their advantages are discussed in detail.

10


2.2.1 Laboratory Setting

In a laboratory setting, moods are induced in the participants and then the desired data is

collected and studied. To induce moods in a participant, one needs a mood induction proce-

dure (MIP). This is an experimental technique to establish a particular mood in a subject.

Westermann et al. [66] listed nine categories of MIPs: imagination, Veltren, film/story, mu-

sic, feedback, social interaction, gift, facial expression and combined MIPs. The film/story

technique and the facial expression technique are discussed next in more detail.

The film/story MIP presents some narrative or descriptive material to the participants to

stimulate their imagination. The participants may identify themselves with certain protago-

nists. The material can be either an elaborate story, a short scene from a film or a description

of scenarios. The material is explicitly selected according to the desired mood to be induced.

Furthermore, this MIP is employed either with or without explicit instruction. When using

explicit instructions, the participant is asked to imagine how it feels to be involved in the

presented situation. This type of MIP is one of the most effective techniques for inducing

both positive and negative moods into participants.

The facial expression MIP is based on the facial feedback hypothesis, proposed by Laird [39].

The expression of the participant’s face is manipulated in order to induce a certain mood.

The participants are instructed on how to contract and relax different facial muscles in order

to produce a frown, a smile, etc. Very often, extra material like a pen is used to enforce the

muscle positioning. The facial expression MIP was found to have a success rate of 50%.

Using MIPs has the advantage of being able to control the moods of the participants. However,

experiments take a considerably long time depending on the technique that is used and it is

not always guaranteed that the induction of the desired mood has succeeded. Furthermore,

individuals may react differently in a laboratory setting compared to a more naturalistic

setting because the experimental situation may influence the individual. Participants may

guess what type of mood is desired by the experiment and adjust their reaction towards

that mood to please the experimenter [47]. This may cause the results obtained by these

experiments not to be useful in real-world environments.

2.2.2 Naturalistic Setting

A naturalistic setting during an experiment aims to observe participants in their natural

setting to have minimal influence on the participant’s behaviour. A self-report recall survey

is one of the possible approaches that can be taken. This requires the participants to record

their experiences after they have occured. However, participants seem to suffer from recall

issues very often which results in possible inaccurate recordings.

Another approach is the experience sampling methodology (ESM) [29]. In this technique a

11


participant’s experiences are recorded as they occur at certain moments in time. This is often

done using some kind of notification that requests the participant to provide responses to

questionnaires at these moments. The experience sampling methodology has the advantage

that it can capture daily life from moment to moment without the problem of recall issues.

It is much easier for participants to indicate their experiences at the moment that they are

experiencing them. However, a high sample frequency causes the participant to be requested

to provide responses very often and could be burdensome and lead to selective non-compliance.

A main disadvantage of using a naturalistic setting is that all techniques provide the exper-

imenter with subjective information. This could lead to biased information and individuals

may repress certain information or change their responses to fit the norm of the participant’s

culture. Despite the disadvantages of the experience sampling methodology, this technique

will be used in this research because the obtained results will be much more useful in real-world

application. Therefore a laboratory setting is not useful to this research.

2.3 Machine Learning

From the gathered data a set of features are selected. In the next step automatic identification

of the emotions is done by applying machine learning using these features. Machine learning is

a branch of artificial intelligence that evolved from pattern recognition and computer learning

theory [9]. It studies the construction of algorithms that offer the ability for computers to

learn without explicitly being programmed. It takes data as input and learns from this data

to make predictions.

A clear distinction is made between three main categories: supervised, unsupervised and

reinforcement learning algorithms. Algorithms in the first category are also commonly called

classification or regression algorithms for discrete or continuous data respectively. They take

labeled data as input, which means that the input data exists of pairs of input vectors and

corresponding output vectors. The goal is to build a model that explains the correspondence

between these input and output vectors. Based on this model, the algorithm can then predict

outputs for new unseen input data. Based on previous research in affect recognition (see

Appendix A), the supervised learning algorithms discussed here are: decision tree (DT),

Bayesian network (BN), naive Bayes (NB), k-nearest neigbors (k -NN), support vector machine

(SVM), random forest (RF) and artificial neural networks (ANN).

Algorithms in the second category contain clustering algorithms and dimensionality reduction

algorithms. They take unlabeled data as input and their goal is to find an underlying structure

in this data. Unsupervised learning is mainly used in pattern recognition and computer

vision. The following well-known unsupervised learning algorithms used for clustering will be

discussed here too: k-means clustering and Gaussian mixture model (GMM).

12


The last category, reinforcement learning works a little bit different. According to the theory

of operant conditioning in psychology, a human can learn something by getting rewarded

for taking certain desireable actions in specific environments and getting punished for taking

undesireable actions in specific environments. This concept is also used in reinforcement

learning. The agent tries to find a sequence of actions that leads to the greatest accumulated

reward. This technique is not discussed here as it is not applicable to the domain.

2.3.1 Decision Tree

Decision tree is one of the oldest machine learning techniques. It is relatively simple to

understand. It tries to find appropriate split values for the different features under observation

such that an optimal set of splits is determined to explain the output vectors. The advantage

of decision tree is that the solution is easy to understand as it is represented by a tree in

which each node represents a feature split and each edge represents a decision based on this

split (see Figure 2.4). Thus, each data point ends up in one of the leaf nodes of the decision

tree and output vectors for new data points are predicted by passing these new data points

through the tree to one of the leaf nodes. The disadvantage of decision tree is that it is

relatively slow to train and has a risk of overfitting (when too many splits are made). A

solution for the overfitting problem is pruning the tree. Specific decision tree algorithms are

Classification And Regression Tree (CART), ID3 and C4.5.

(a) Example of a decision tree

(b) Graphical representation

of the solution generated

by the decision tree

Figure 2.4: Decision tree [3]

2.3.2 Bayesian Network

A Bayesian network is a probabilistic graphical model that uses a directed acyclic graph in

which the nodes represent random variables and edges represent conditional dependencies.

Each node has a probability function that takes a set of values for its parent variables as input

13


and gives the probability of the variable represented by the node as output. For discrete parent

variables, this function can be stored as a table (see Figure 2.5).

Figure 2.5: Bayesian network [2]

The main advantage of Bayesian networks is that they provide a way to make use of the

conditional independencies of the network to save a lot of storage space and calculation time.

For example, in Figure 2.5, the variable S does not depend on R and W and thus, instead of

storing P (S | C,R,W ) which would contain 23 = 8 values we can just store P (S | C) which

contains only 22 = 4 values.

In the simplest case, the Bayesian network is created based on the knowledge about relation-

ships between different variables. Otherwise, techniques exist to learn the network structure

based on data. Also, the parameter values need to be available to know how exactly variables

influence each other. When not available, techniques exist to learn these parameter values.

Bayesian networks can then be used to infer information about unobserved variables based

on observed variables.

2.3.3 Naive Bayes

A classifier gets a vector of feature values as input and yields the corresponding output based

on the training data. To work in a more probabilistic context, a classifier can yield the

probabilities for each possible output instead of just the most likely output. In this case,

the goal is to calculate P (Ck | x1, x2, ..., xn) for each Ck where k is the number of possible

outputs and n is the number of features. Using Bayes’ theorem, this can be rewritten asP (Ck,x1,x2,...,xn)P (x1,x2,...,xn)

. The denominator does not depend on Ck and thus, since the feature values

are given, is constant. Using the chain rule for conditional probability the nominator can be

14


rewritten as P (Ck)P (x1 | Ck)P (x2 | Ck, x1)...P (xn | Ck, x1, x2, ..., xn−1). It is clear that for

large n or a lot of possible feature values, this calculation becomes a very difficult task.

Naive Bayes makes the naive assumption that all features are conditionally indepent of each

other given the output variable. This means that the naive Bayes technique takes every

feature into account as having an effect on the output variable but not on any other feature.

This can be illustrated with a simple example. When we observe a medical cancer experiment

and take two features: smoker and pneumonia. Both of these features can be indicators for

the presence of the output variable cancer. We know that someone who smokes is more likely

to have pneumonia. If we want to know the probability of a person having cancer, we should

calculate P (C | S, P ) and thus more specifically P (C)P (S | C)P (P | C, S). However, Naive

Bayes makes the assumption that smoking does not influence pneumonia and thus rewrites

this as P (C)P (S | C)P (P | C).

The naive assumption of conditionally independent features causes the calculations to become

simpler and thus causes algorithms to be much quicker. Despite the apparantly oversimpli-

fied model, naive Bayes works quite well in many complex situations. However, it is still

outperformed by some other techniques.

2.3.4 k-Nearest Neighbors

The k-Nearest Neighbors algorithm saves all training data, this means all feature vectors and

their corresponding output. When a new instance is presented for which a prediction should

be made, the algorithm calculates the distances from this new instance to all training data

points and selects the k nearest points. Then, for classification, a majority voting is calculated

over these k nearest points to obtain a class prediction for the new instance. For regression,

the average is calculated over these k nearest points.

Figure 2.6: k-Nearest Neighbors: for k=3, the prediction for the new instance is class B; for k=6,

the prediction is class A [6]

15


The algorithm has no parameters except for the value of k. However, choosing the value of k

is important as is illustrated in Figure 2.6. A k-value that is too high does not offer enough

sensibility for detail while a value that is too low is too sensitive in a small environment of the

new instance. Furthermore, instead of just taking the k nearest training data points, one can

also use some weight function to take data points that are closer to the new instance more

into account than points that are far away. Combining this with a higher value for k could

compensate for the disadvantages discussed earlier.

During training, the algorithm only saves the training data points but does not perform

any calculation. All calculations are performed during classification. Therefore, the k -NN

algorithm is called an instance-based or lazy learning algorithm.

2.3.5 Support Vector Machine

The Support Vector Machine algorithm analyzes multi-dimensional training vectors that con-

tain feature values and tries to construct a set of hyperplanes that can be used for classification

of the vectors. A good set of hyperplanes is achieved when these hyperplanes cause the data

of different classes (each on a certain side of the hyperplanes) to be separated maximally and

thus that the generalization error is minimal. This principle is illustrated in Figure 2.7.

Figure 2.7: Support Vector Machine [12]

The general SVM concept as explained until now is very interesting but cannot handle data

that is not linearly separable. To cope with this problem, there exists a technique called the

kernel trick. This technique transforms the data to a higher dimensional space where the

data might be linearly separable (see Figure 2.8a). Applying the normal SVM concept in this

space then yields a set of hyperplanes that can be transformed back to the original space (see

Figure 2.8b). Thus, using the kernel trick it is possible to learn a nonlinear SVM while still

using the linear formulation of SVM.

16


2.3.6 Random Forest

Random Forest can be considered as an ensemble algorithm. It uses many decision trees

for training and combines their outputs. More specific, it divides the training data into

many subsets. To construct each subset a number of samples are randomly selected with

replacement. This technique is called tree bagging and is also used by Random Decision Trees.

Furthermore, during the learning process of each tree a random subset of features is selected

at each candidate split. The outputs of all trees are then combined using a majority voting

and presented as the output of the Random Forest algorithm. This algorithm is illustrated in

Figure 2.9. It usually has a higher accuracy and robustness than normal element classifiers,

can be highly parallellized and can be used for very large datasets and a large number of

features. Random Forests correct the tendance of decision trees to overfit and can be used to

increase the estimation of relative feature importances.

Figure 2.9: Random Forest [11]

2.3.7 Artificial Neural Network

Artificial neural networks are a family of algorithms that try to estimate or approximate

functions that are usually unknown and often depend on a large number of inputs. They

are inspired by biological neural networks in the brain and try to mimic their functionality.

An artificial neural networks consists out of neurons that are interconnected. Through these

connections they can communicate. A network has a number of input neurons and one or

more output neurons. In between there can be so-called hidden nodes. A layer of hidden

nodes is called a hidden layer. This structure is illustrated in Figure 2.10a. Each neuron thus

has a number of inputs and an output. The inputs of each neuron can be weighted based on

knowledge and experience. The output of each neuron is calculated by a function using the

17


inputs of the neuron. Different types of neuron functions exist but one that is used very often

is the sigmoid-function. A neuron is illustrated in Figure 2.10b.

An artificial neural network is very interesting because of its flexibility. It can deal with

highly-dimensional data of both contiuous and discrete nature and is capable of non-linear

classification. It can be used for both classification and regression and has a relatively low

algorithm complexity. Disadvantages are that there is no real guideline for determining the

amount of neurons and layers that should be used. Using too many neurons and layers will

have a negative effect on the performance while using too little neurons and layers will cause

a decrease in accuracy. The classifier that is generated by an artificial neural network is also

very hard to interpret by humans.

Some of the disadvantages of classic artificial neural networks can be countered by using deep

learning techniques. Deep learning algorithms use artificial neural network architectures with

many hidden layers. Deep learning algorithms assume that data is generated by many different

underlying factors on different levels and that these factors are organized into multiple levels of

abstraction. They offer the possibility to replace manually crafted features by an unsupervised

way of feature extraction that discovers the best network structure for the given data.

2.3.8 k-Means Clustering

The k-means clustering algorithm is an unsupervised learning algorithm that tries to partition

unlabeled data vectors into k different clusters, in which each cluster has a mean vector. The

algorithm is initialized with k initial mean vectors, selected from the dataset, forming k

clusters. This initial set of mean vectors is often chosen randomly. Next, each other data

vector in the dataset is assigned to the cluster of the nearest mean vector. Then, for each

cluster the new mean vector is calculated and the assignment step is repeated. These two steps

are repeated for a finite number of times or until the cluster means don’t change anymore

and the algorithm converges. The algorithm is illustrated in Figure 2.11. It is clear that

the quality of the solution of this algorithm depends a lot on the choice of the initial set

of mean vectors. Therefore, the algorithm is often run multiple times, each run having a

different initial set of mean vectors and at the end for each data vector a majority voting

is performed over the different solutions. Another property that influences the quality of

obtained solutions is the choice of k. This value should match the data on which clustering is

performed. Furthermore, k-means clustering works well when the different clusters are well

separated from each other, otherwise it will probably not be able to distinguish them. It is

also not robust against noise and outliers and fails for a non-linear dataset.

18


Figure 2.11: k-Means Clustering [5]

2.3.9 Gaussian Mixture Model

The Gaussian Mixture Model is a probabilistic model that assumes that all data points are

generated by a weighted mixture of a finite number of underlying Gaussian distributions.

The parameters of these distributions are unknown and thus need to be learned from training

data. It can be viewed as a soft version the k-means clustering algorithm. The algorithm

that is used to train Gaussian Mixture Models is called the EM-algorithm (Expectation-

Maximization). This algorithm works in about the same way as the k-means clustering

algorithm. It starts with a set of initial parameters and creates a function for the expectation

of the log-likelihood given these parameters. Then it adapts the parameters to maximize the

expected log-likelihood found in the expectation step. These two steps are then repeated for

a finite number of times or until the algorithm converges to a certain solution. This algorithm

is illustrated in Figure 2.12. The same problem of choosing the right initial parameters can

be found here. Very often, the k-means clustering algorithm is used several times first as

described above and its result is used for the initial parameters of the EM-algorithm.

19


Figure 2.12: EM-algorithm [4]

2.4 Keystroke Dynamics

Keystroke dynamics is the study of the characteristics that are present in a user’s typing

rhythm when using a keyboard. It typically focuses on timing characteristics of typing to

identify patterns in the data. Most often this includes the analysis of duration of a keypress

or group of keys and the latency between consecutive keys. This classic set of features can

be expanded with other features that not only give information about ”how” the user is

typing but also about ”what” the user is typing. The way that humans type is influenced

by many factors. Except for individual differences, typing behaviour is also influenced by

the context in which the user is typing. This can be used as a feature because the typing

rhythm may highly depend on whether the user is typing in an instant messaging application

or a word processing application for example. Furthermore, an individual’s typing behaviour

can change when the individual experiences different emotions. This means that information

about a user’s typing behaviour may allow the inference of the user’s emotional state. This

is exactly the objective of this research. In this section some terminology is presented that

is used in keystroke dynamics literature and different approaches that can be followed in this

area are examined. Furthermore, some of the most common features that are used in related

literature are discussed in more detail.

2.4.1 Fixed Text Analysis

A first approach, called fixed or static text analysis, usually requires participants to type one

or more fixed pieces of text multiple times during the data collection process. Using fixed

text to build a model implies that this model can only be used at those moments that the

user types one or more of the fixed pieces of text on which the model was trained. Fixed

20


text studies typically require participants to enter the fixed text in some predefined text box

while the keystrokes are monitored. This approach is typically useful in authentication and

security applications as it can be used on passwords.

2.4.2 Free Text Analysis

Another approach, called free or dynamic text analysis, does not require the user to type

the same piece of text each time during data collection but allows any sequence of text as

input. Using a dynamic approach enables the model to be used during continuous monitoring,

which is very desireable for affect recognition because this means that emotional information is

available in the computer system at any time. Free text studies typically involve participants

entering any text they like in a predefined text box while the keystrokes are monitored.

However, this is not necessarily the case. It is also possible to use software that monitors all

keystrokes in any application. This approach has the benefit that the participant does not

have to think about what he will type first before actually typing. Another advantage is that

this way, the application context can be monitored too. Furthermore, using this approach

it is also possible to obtain the keystroke data unobtrusively, which is also highly desireable

for affect recognition. These benefits form the motivation for using free text analysis in this

research instead of fixed text analysis.

2.4.3 Keystroke Features

The most commonly used features in related literature are timing features. These include

features that are calculated on both individual keys as well as multiple keys. A first typical

feature is the key duration. This is the time elapsed from the moment that the key was

pressed by the user (keydown event) until the time that the key was released (keyup event).

A keyup event of a certain key must not necessarily follow the corresponding keydown event

directly. For example, when typing an uppercase character the first event is a keydown event

for the Shift key, followed by the keydown event of the character key, then followed by the

keyup events of both keys in any order. Another case in which this is possible is when a

user is typing very fast. In the case of typing two characters very fast, it is possible that the

second character key is pressed before the first character key is released. These cases must

be taken into consideration when analyzing the collected keystroke data.

The key duration feature has also been used on multiple consecutive keys or graphs. A digraph

contains two consecutive keystrokes, whereas a trigraph contans three; this continues for any

number of keystrokes, which creates n-graphs.

A second typical feature is the digraph latency. This is the time elapsed from the keyup

event of a key to the keydown event of the next key. Taking into account the considerations

21


mentioned above, the latency can be negative when the keyup event of a key comes after the

keydown event of the consecutive key.

When deciding which features to use, it is important to know that larger n-graphs are less

likely to occur in free text and thus cause sparsity when used as features. Using only timing

features has the benefit of maintained privacy as the actual content of the text is not used.

However, this data very likely contains valuable information and therefore it will be used in

this research.

2.5 Text Content Analysis

The actual content of text contains a lot of valuable information concerning affect. The

starting point of a linguistic analysis of text to extract emotional information from it is the

use of specific affective lexicons. An interesting lexicon is the Affective Norms for English

Words (ANEW) [18]. This lexicon presents a set of verbal materials that have been rated

in terms of pleasure, arousal and dominance. SentiWordnet [24] is another lexical resource

that keeps information on the polarity of subjective terms. For each term three scores are

calculated: objectivity O, positivity P and negativity N . The scores are related to each other

as follows: O = 1 − (P + S). WordNet Affect [60] is a third lexicon that was motivated

by the need for a lexical resource containing explicit fine-grained emotional annotations.

Clore et al. [20] argued that words need to be distinguished based on the fact whether they

directly refer to emotional states or contain an indirect reference that depends on the context.

WordNet Affect is an extension of the WordNet database [49].

A more abstract analysis using specific textual features is also possible. Vizer et al. [64]

used a number of features defined by Zhou et al. [68]. However, some of these features are

fairly straightforward to use (e.g. lexical diversity, content diversity, average word length,

average sentence length) while others still require some extra (possibly manual) semantical

information on word types. For example, a possible feature would be the rate of self-reference

words (e.g. me and I) but this assumes that the system knows that these words are self-

references, which is not trivial as some mapping will be needed. Other possible features are

special punctuation and uppercase word rate [14].

You can also make use of the ISEAR (International Survey on Emotion Antecedents and

Reactions) dataset that contains reports of situations accompagnied by the corresponding

emotions that were experienced in these situations. The possible emotions are joy, fear, anger,

sadness, disgust, shame and guilt. The ISEAR project was directed by Klaus R. Scherer and

Harald Wallbot. This dataset was used to train a classifier that can then be used on unseen

pieces of text. The training process of this classifier was done using a vector space model.

Such a model assigns a weight to every term in a document. To calculate these weights,

22


several schemes exist. A commonly used scheme is TF-IDF weighting (Term Frequency -

Inverse Document Frequency). To calculate this weight for a term in a particular document,

the term frequency for that document is calculated and then the inverse document frequency

is calculated by taking the logarithm of the division of the total number of documents in

the dataset and the number of documents containing the term for wich the weight is being

calculated. After normalization, the formula for the weight becomes wi = tfi∗log(N/di)∑nj=1 tf

2j ∗log(N/dj)

2

where wi is the weight of the term, tfi is the term frequency of term i, N is the total number

of documents in the dataset, di is the number of documents containing term i and n is the

total number of unique terms in the dataset. All term weights can then be calculated and all

documents can be converted to vectors in the vector space model. This way, each emotion

has several corresponding vectors and a classifier can be trained on this data. This classifier

can use several similarity measures. Nahin et al. [52] achieved better accuracy using Jaccard

similarity than using cosine similarity.

2.6 Mouse Behaviour

Except for text content and keystroke dynamics, also mouse behaviour may contain useful

information on an individual’s emotional state. Modelling mouse behaviour requires capturing

mouse events. One can distinguish single clicks, double clicks, either using the left, right or

maybe middle button. Furthermore, one can observe mouse movements and mouse wheel

movements. From the mouse movement data, the distance, angle and speed features between

pairs of data points can be extracted. Usually, it does not make much sense to calculate these

features for each pair of consecutive data points. Instead, a frequency parameter is used that

defines the separation between observed pairs of data points. Also the time that the mouse

is inactive can be used as a feature. From these features, one can calculate means, standard

deviations, moment values and frequencies over windows of N data points.

2.7 Authentication and Security

The interest in keystroke dynamics originated after a study by Gaines et al. [27] who observed

that individuals seem to have unique typing behaviours. It was suggested to use this infor-

mation in authentication systems to provide an extra security measure. In these applications,

the goal is to identify a user by analyzing his typing pattern. This principle is very analogous

to the usage of handwritten letters and signatures. This goal can be achieved by asking a

user to type a fixed sequence of keys (possibly multiple times) and analyze the same timing

features discussed above. This fixed sequence of keys could be a password for example. When

a user wants to authenticate himself, he then needs to supply the correct password but also

supply it in the correct way. However, it is also possible to use a free text approach. This

is especially interesting because a system could be used that continuously monitors the user

23


during an authenticated session and detects when a possible intruder starts using the system.

One of the first studies in the area of authentication using keystroke dynamics was performed

by Monrose and Rubin [50]. They collected keystroke data from several participants that

were asked to type some fixed sentences as well as some free text. They tried different

distance metrics using the k-nearest neighbor algorithm and achieved a correct identification

rate of approximately 90% using a weighted probabilistic classifier. They observed that free

text classification did not perform as well as fixed text classification and attributed this to

variations in operational conditions in which the user may be absorbed (e.g. emotionally

charged situations) [51].

Fairhurst et al. [25] studied the effect of multiclassifier systems on the accuracy of identity

prediction based on keystroke dynamics. They used a k -NN classifier, a decision tree algo-

rithm and a naive Bayes classifier. They combined these classifiers using DCS-LA (Dynamic

Classifier Selection using Local Accuracy), majority voting and summation. These multiclas-

sifier systems yield much higher accuracies than the single classifiers do. Furthermore, they

achieved accuracies of more than 95% for gender prediction. This is an important observation

for this research as this indicates that it is possible to distinguish between men and women,

implying that this classification will make emotion classification easier too, assuming that

emotional behaviour is also related to gender.

Some interesting work in the area of identification using free text keystroke dynamics has

been done by Traore et al. [61].They propose a new approach that combines monograph and

digraph analysis and uses a neural network to predict missing digraphs based on the relation

between the monitored keystrokes. They achieved a false acceptance rate (FAR) of only

0.0152%, a false rejection rate (FRR) of 4.82% and an equal error rate (EER) of 2.46% in

a heterogeneous environment. The results for an homogeneous environment are even better

but are less reliable due to a low number of samples that were used.

An important observation in keystroke dynamics research for authentication purposes is that it

is possible that an individual’s typing rhythm changes throughout the course of his life. There

have also been many reports on nonneglibible variances in an individual’s typing behaviour.

This can be attributed to both physiological and psychological factors. This last factor is

especially of great importance in this research.

Also mouse behaviour can be used in authentication systems. Pusara et al. [54] presented an

approach to user re-authentication based on only mouse behaviour data. This is an example

of continuous analysis and the detection of anomalies, leading to a re-authentication request.

They used the C5.0 decision tree algorithm using the mouse features discussed above and

have achieved a FAR of 0.43% and a FRR of 1.75%.

24


2.8 Emotion Recognition in Computer Systems

In this section, related work in the specific domain of emotion recognition in computer systems

will be discussed. A summary of the methodologies, techniques and results of some of this

previous work will be given. This information is also put in a concept matrix in Appendix A.

Some of the first work that was done in this area can be attributed to Zimmerman et al. [70].

They described a method to correlate human keyboard and mouse interaction with affec-

tive states. They used movie clips to induce different affective states in the PA emotion

model. Participants had to shop on an e-commerce website for office supplies while several

physiological parameters were measured such as respiration, pulse, skin conductance level

and corrugator activity. During the experiment all mouse and keyboard actions were regis-

tered too. Later [69], they were able to show, using this methodology, that affect impacts

motor-behaviour of computer users.

Vizer et al. [64] proposed a new way of assessing human cognitive and physical stress by

analyzing keystroke dynamics using both content and timing features. This is based on the

observation of variability and drift in an individual’s typing pattern which has been attributed

to situational factors as well as stress and fatigue and thus to changes in cognitive and physical

function. They collected free text keystroke samples over multiple sessions for each participant

consisting of baseline and control samples under no stress, samples under cognitive stress

and samples under physical stress. Participants had to describe their stress level after each

session using an 11-point Likert scale. Cognitive stress conditions were induced using mental

multiplication and three-back number recall. Physical stress was induced by walking on a

treadmill and performing biceps curls. They used different machine learning algorithms and

achieved correct classification rates of 62.5% for the physical stress condition and 75% for the

cognitive stress condition. They stated that their results are comparable to other affective

computing techniques for stress detection.

Epp et al. [23] investigated the possibility to identify different emotional states using keystroke

analysis. They focused on gathering keystroke data in a natural context using the experience

sampling methodology rather than a laboratory environment. Each participant installed a

program running in the background to collect free text. Based on the participant’s activity, the

program prompted the participant to fill out a questionnaire that contained 15 5-point Likert

scale questions, each one regarding one of the following emotional states: anger, boredom,

confidence, distraction, excitement, focused, frustration, happiness, hesitance, nervousness,

overwhelmed, relaxation, sadness, stress and tired. After filling out each questionnaire, the

participant had to enter a piece of fixed text. A large number of features were extracted from

this data: duration times for single keys, digraphs and trigraphs; latency times for digraphs

and trigraphs; times between pressing two successive keys in graphs; number of events in

25


graphs; number of characters, numbers, punctuation marks and uppercase characters. They

also used outlier removal, feature selection and undersampling to avoid class skew. The C4.5

decision tree algorithm was used to build a binary classifier for each of the 15 mentioned

emotional states. The classifiers for free text did not perform well so only the results for fixed

text were included. They achieved reliable accuracy rates ranging from 77.4% to 87.8% for

the confidence, hesitance, nervousness, relaxation, sadness and tiredness states.

Alhothali [13] used a dialogue-based tutoring system to spontaneously induce affective states

that relate to learning: delighted, neutral, confused, bored and frustrated. The experiment

was performed in a laboratory setting and collected timing, typing and response features

during the participant’s interaction. The participants had to determine their emotion after

each response using 5 statements on a 5-point Likert scale. Two external judges also observed

the participants and provided their responses for the participants. The classification was

divided in two stages. First, classification of the emotional valence was performed. This

classification method yielded an accuracy of 82.82%, 72.02% and 77.2% for the user-labeled,

judge1 and judge2 datasets respectively using an artifical neural network. The classification

accuracy for the specific emotions only achieved an accuracy of 53.59%, 45.6% and 53.89%

for the user-labeled, judge1 and judge2 datasets respectively, also using an artifical neural

network.

Tsoulouhas et al. [62] introduced a method to detect student boredom during the attendance

in an online lesson that was followed in a laboratory setting. They extracted features from the

movements of the user’s mouse such as inactivity timings, speeds and direction of movements.

These features were then used as input to the C4.5 decision tree algorithm. They also used

the content features of the different learning objects. Each time after a certain period of

inactivity the user is prompted with a pop up dialog which asks the user to indicate whether

he is bored with a single ”Yes” or ”No”. The dialog appears only when the mouse is moved

again so that the duration of the pause is not influenced by the system. They achieved a

correct classification rate above 90%.

Except for analyzing keystroke and mouse features, it is also possible to use other additional

sensors that are present in today’s smartphones. LiKamWa et al. [44] performs an exper-

imental study, which they extended in [45], to classify different moods using this kind of

data. As denoted above, moods are less intense and last longer than emotions. They use the

circumplex emotion model to quantify moods and use frequency features from application

usage, phone calls, SMSes, emails, web browsing history and location. All data is collected in

a natural environment. A very high accuracy rate is achieved for individual models and they

observed that when using some of the chosen features individually, a similar or slightly worse

accuracy is achieved compared to when all features are used. This indicates that only a subset

of features is responsible for the high accuracy. However, this specific subset is dependent on

26


the specific user that is observed. Just like in the research of Epp et al. [23], the acquired

dataset suffers from class skew.

Continuing in the area of smartphones, Lee et al. [42] investigated the possibilities to auto-

matically recognize emotion in social network service posts so that correct emoticons could

be automatically added to it. Participants were asked to write short messages in a natural

environment reporting their emotional state. Seven emotions were predefined: happiness,

surprise, anger, disgust, sadness, fear and neutral. Initially 14 features were used. Some of

them were keystroke features such as typing speed and frequencies of pressing backspace, en-

ter and special symbols. Other features described the content of the messages that were sent.

Furthermore, additional features were used such as touch count, long touch count, device

shake count, illuminance, discomfort index, location, time and weather. Of these 14 features

10 were selected that correlated the most with the emotion information to build an inference

model. A Bayesian network was used as classifier. After the emotion of a text message was

recognized, the user decided whether the emotion was accurate or not and provided feedback.

The model was then updated immediately. An average classification accuracy of 67.52% was

achieved but this strongly depended on the type of emotion. The best recognition rates were

achieved for happiness, surprise and neutral. However, they have found a general correlation

between the recognition accuracy and the amount of observation cases for each emotion. This

indicates that more data is necessary for emotional states for which the accuracies are lower.

Tsui et al. [63] validated the hypothesis about the existence of the difference in typing patterns

between different emotional states using the facial feedback hypothesis. They performed an

experiment in a laboratory setting in which they asked participants to type a fixed number

sequence under different emotional states induced by the facial feedback. One state induced

positive emotion and another one induced negative emotion. During the experiment, the

keystroke data was recorded. They used features such as duration and latency. The results

supported the initial hypothesis about the relation between typing patterns and emotional

states.

Nahin et al. [52] tried to detect emotions by analyzing a combination of keystroke dynamics

and textual contents typed by a user. They distinguished seven emotional classes and used

both fixed text analysis and free text analysis. Two additional classes were used (neutral and

tired) in case the user is not in any of the seven classes. They extracted 19 keystroke features

and 7 of them were actually selected. For textual content analysis the ISEAR dataset and

WordNet was used for text pattern analysis. The text is converted to a vector using the

vector space model. Different algorithms were used to train models for each emotional class

and relatively high accuracies were achieved for fixed text analysis. The results for free text

were less satisifying but still better than chance.

Hernandez et al. [30] present a method to evaluate a user’s boredom and frustration in an

27


intelligent learning environment. In contrast to [62] they mainly focus on free text analysis

and perform experiments in a natural environment. They extract commonly used keystroke

features based on previous research and also take mouse dynamics into account as extra

features. Furthermore, a genetic feature selection algorithm is used to obtain a good feature

subset and the k -NN algorithm is used for classification. This improves the classifier results

and obtains a correct classification rate of about 83% and 74% for boredom and frustration

respectively. These rates are similar to fixed text analysis classification rates obtained in

other research.

Another research conducted by Lee et al. [43] examines the source of variance in keystroke

typing patterns caused by emotions using visual stimuli to induce emotional states. They

used the International Affective Picture System (IAPS) in a laboratory setting and asked the

participants to type a fixed number sequence. The keystroke data was recorded during the

typing task and afterwards a self-assessment manikin had to be submitted. The results of

this manikin were translated into three levels of the ANOVA factors valence and arousal. The

results of the experiment indicate that the effect of emotion is significant in the keystroke

duration, latency and accuracy rate of the keyboard typing. However, the size of the emotional

effect is small compared to the individual variability.

Shukla et al. [58] did some research that is comparable to previous experiments but proposed

the usage of fuzzy logic. Bakhtiyari et al. [15, 16, 17] describe a fuzzy model for multi-level

human emotion recognition through keyboard keystrokes, mouse and touch-screen interac-

tions. The reason for using a fuzzy model is based on the assumption that emotions have a

fuzzy basis and that human beings may have different emotions at the same time in different

levels. They classified emotions into five different levels and used and experience sampling

methodology to collect their data. Participants were asked to indicate their levels of different

emotions every 4 hours. One important observation was that the emotions indicated by men

were stronger on average than the emotions indicated by women. They used the support

vector machine algorithm to classify emotions and they were able to increase the detection

accuracy up to 5% compared to non-fuzzy models.

Salmeron et al. [57] compared different emotion labeling methods and combinations of these

methods to see how accuracies can be improved. They compared taking a self-assessment

manikin provided by the user, a manikin provided by expert psychologists, a categorical ap-

proach using PANAS and a combination of the manikin provided by the user and the manikin

provided by the psychologists. Furthermore, they compare different data sources: keystroke

analysis, mouse analysis, physiological analysis and sentiment analysis. Data collection was

performed in a laboratory setting using free text. Their results indicate that sentiment anal-

ysis yields the highest accuracy. The algorithm and labeling methods that yield the best

results depend on the data source that is used.

28


Kolakowska recently reviewed a lot of the research that has been done so far in the area of

emotion recognition in computer systems [37, 38]. The most important conclusions for future

work are that the data collection stage takes a very long time and that this should be taken

into account during the experiment design. Furthermore, it seems that individual models

will perform much better than general models. It is often very hard to compare different

studies because they use different datasets, emotion models, feature sets and algorithms.

The used datasets are usually not made publicly available which prevents the improvement

of existing research results. Moreover, recognition accuracy seems to vary among different

emotional states. It is crucial to obtain enough data to obtain reliable results. Especially

in the case of individual models, data of different users cannot be combined and thus each

user needs to provide a large amount of data. A number of interesting possible applications is

presented [36] including possibilities for optimizing software usability, improving development

processes, education, websites and video games. For each of these possible applications,

various research methodics are proposed as well as the possible challenges that need to be

overcome.

29


(a) Transform the data to a higher dimensional space so that it becomes linearly separable

(b) Apply the SVM technique and transform the solution back to the original space

Figure 2.8: Kernel trick [7]

30


(a) General structure [1]

(b) Neuron [10]

Figure 2.10: Artificial neural network

31

Chapter 3

Data Collection

This chapter discusses the first step of the research. Before being able to analyze data and

start building models for emotion recognition, data needs to be obtained. A field study was

performed using custom built software. In the first part of this chapter the details of this

field study are presented. Then the actual software that was built is discussed.

The field study was designed to gather participants’ keystroke, mouse, location and weather

data together with subjective indications of emotional states in the pleasure-arousal-dominance

model. The study is conducted in a naturalistic setting while participants perform their daily

tasks using an experience sampling method. This approach is more representative for a

real-world context and allows to capture mostly uninfluenced data. The participants are pe-

riodically requested to indicate their emotional state while other data is continuously collected

in the background. The data collection process and scheduling of emotional state requests is

done by a piece of software that is installed on the participants computer.

This approach imposes some disadvantages compared to a more controlled approach. It is

almost impossible to control emotional states as this requires eliciting emotional responses,

which is hard to achieve in an uncontrolled environment. This has as a result that there are

no real guarantees for clean and correctly labeled data. Therefore, to be able to create models

with sufficient predictive power, more data needs to be gathered when using an uncontrolled

approach. Again, this is difficult to enforce in an uncontrolled environment as participants

are not obligated to spend certain amounts of time on the computer and provide a sufficient

amount of data. This requires the study to run over a relatively long period of time, preferably

using participants that spend a lot of time on the computer.

Controlled lab studies have the disadvantage that they can be very time-consuming which

means that to collect a sufficient amount of data, participants need to attend multiple ses-

sions. Thus, it will be more difficult to find participants that are willing to partake in the

study. Additionally, this could be very costly as it may require more administration and com-

32

Chapter 3. Data Collection

pensation. Furthermore, often only a small set of variables are used which may be a limiting

factor for exploratory research in order to determine which variables are of interest.

The usage of an experience sampling approach was chosen over a controlled laboratory ap-

proach after weighing the relative advantages and disadvantages of each approach.

In this research, the pleasure-arousal-dominance model for emotion indication is chosen as it

is a general emotion model with little or no downsides. This may allow the determination of

which emotional states are most suited for detection using this approach while still allowing

participants to use other emotional states. Furthermore, this model lends itself very well for

mapping onto other models such as a categorical model. Participants were periodically asked

to indicate their emotional state using three scales (one for each axis of the PAD model) that

ranged from 0 to 100, where 0 stands for the most negative value on the axis and 100 stands

for the most positive value on the axis. Note that these numbers were not shown on the

scales so a participant would not take any numerical value into account. This enforces a more

intuitive way of emotion indication.

When asked to indicate their emotional state, participants were not obligated to do this.

When a request was ignored by the participant, the collected data was transmitted anyway

without a corresponding emotional state label. Unlabeled data can be used to create a profile

for each participant that contains information about its average behaviour.

The data that was gathered using this study contained a lot of raw elements which need

further processing. For example, in theory the implemented emotion model allows for 1013

different emotional states. It is possible to perform regression or to perform discretization by

reducing multiple values to a smaller number of classes. For example, values that lie at the

outer ends of each scale could be taken together with more moderate values on the same scale

if it is observed that participants are less inclined to choose extreme values. Furthermore,

keystroke and mouse data are collected with corresponding timestamps. However, the time

differences between the keystrokes and mouse movements might be more interesting. This

way, there will be many other needs for processing the raw data. This will be discussed in

Chapter 4.

3.1 Field Study

The field study was conducted from November 15th, 2015 until May 1st, 2016 with 14 par-

ticipants contributing data for, on average, 22 weeks. All participants were recruited on our

personal request. No incentives were used for recruiting participants.

33


3.1.1 Set-up

Upon signing up for the study, each participant had to provide some registration informa-

tion (including some profiling information). This was done through a website that was pro-

grammed in PHP1, which facilitated automated processing and the remote administration of

the study. The registration information included the participant’s first and last name, e-mail

address, gender, birthdate, place of birth, occupation, education, nationality, first language,

most used language on the computer, dominant hand, typing skills (low, intermediate or

high), computer skills (low, intermediate or high), percentage of total computer time spent

on the concerned computer, keyboard layout, mouse type and computer type. After submit-

ting this information, the participant was directed to a consent form that needed to be agreed

upon by the participant. The agreement was enforced by an explicit check box that needed

to be checked before being able to continue the signup. The IP address of the participant’s

computer and the timestamp when the user was registered was collected and a unique user

id was generated using the MD52 hashing scheme. All participant information was sent over

a secured TLS/SSL3 connection and kept securely in a MySQL4 database on the server that

hosted the website. After having agreed with the consent form, the participant was directed

to a download page where a simple installer file could be downloaded and clear installation

instructions could be found. All participants were guided through this process as much as

needed. The three pages described above can be found in Appendix B, C and D.

Furthermore, each participant was added manually to a dedicated university mailing list which

was used to be able to quickly send out communication to all participants at once.

3.1.2 Restrictions

The participants were not restricted in any way during the entire course of the study. There

were also no restrictions on the language that could be typed during the study as most par-

ticipants often need to use both Dutch and English. The operating systems that participants

could use were restricted to different versions of Windows. All this because the participants

were required to install platform-dependent data collection software on their computers. This

also restricted the study to only those participants that use the Windows operating system.

The different versions of Windows that were used are Windows 7, Windows 8, Windows 8.1

and Windows 10.

Both laptop and desktop users could partake in the study and external devices could also be

used. This implies that keystrokes may originate from either a laptop keyboard or an external

1PHP: Hypertext Preprocessor https://www.php.net/2MD5: https://en.wikipedia.org/wiki/MD53TLS: Transport Layer Security https://en.wikipedia.org/wiki/Transport_Layer_Security4MySQL: https://www.mysql.com/

34

https://www.php.net/

https://en.wikipedia.org/wiki/MD5

https://en.wikipedia.org/wiki/Transport_Layer_Security

https://www.mysql.com/


keyboard. The same holds for the pointing device movements. These may originate from a

trackpad or an external mouse. However, upon signing up for the study, participants had to

indicate which sources they would be using as well as the keyboard layout that they were

using (AZERTY/QWERTY/...).

3.1.3 Privacy Issues

The sensitivity of the data that was collected at first caused most participants to be appre-

hensive about partaking in the study. Most participants were not eager to be monitored

continuously. As the data collection software actually is a very advanced key logger this is

often associated by participants with malware and privacy invasion. To make sure partici-

pants were comfortable with running the software on their computers, a detailed explanation

of the data collection process and the data storage principles was given individually to each

participant and they had the opportunity to ask questions. A number of guidelines from an

ethical commission were also taken into account. This way, all data was collected anony-

mously using a code that cannot be linked back to the name of the participant after the

study. Every participant also needed to agree with an informed consent form, as mentioned

before. Furthermore, certain guarantees were made to assure safe transmission of the data to

the server.

However, not all persons that were asked to participate in the study felt comfortable after

these extra safety measures and explanations and decided not to partake in the study. A

laboratory study in which participants are aware that they are being monitored would prob-

ably not pose such problems as the participant is able to control his behaviour and can avoid

passing sensitive information. However, the ’hidden’ aspect of advanced key logger software is

exactly what makes using keystroke dynamics and mouse movements so interesting for emo-

tion recognition as the user will not really think about the fact that he is being monitored.

Therefore, the participant’s emotional state will only be minimally influenced by knowing

that they are being monitored.

3.1.4 Meantime Study Evaluation

On December 26th, 2015 all participants received an email (through the mailing list in which

they had been included upon signup) that asked them to fill out a short study evaluation form

that investigated whether participants experienced problems using the software and what the

causes were. The participants were asked how often they fill out an emotional questionnaire

and if this was less than once a day, what the reason was for not doing it more often. The

participants were also asked to indicate how comfortable they are with the study, how often

they use their computers and for which period of times. At the end of the form, participants

could also add extra comments.

35


This evaluation revealed some software issues and also indicated what caused some partici-

pants to provide little amounts of data. The main two causes for providing little amounts of

data were software issues (which will be discussed later) and laziness or inappropriate timing

of the notification. Almost all participants felt comfortable with providing the information, as-

suming the information remains confidential and is only used for academic research purposes.

Some participants were more sceptical. Most of the participants also did not have troubles

indicating their emotional state using the software. Some participants mentioned that they

thought the scales were unclear or they did not understand the PAD emotion model very

well. Most participants spent 3 hours a day on their computer on average and this time is

divided over different sessions of more than an hour on average.

3.1.5 Completion

On May 1st, 2016 all participants received an email (through the mailing list) that notified

them that the data collection process was finished and that they could remove the software

from their computer. Manual check-ups were also performed to make sure that they had

successfully removed the software. The participants were also notified that they could request

the data that was collected about them, if desired.

3.2 Participant Demographics

As mentioned before, some additional information was obtained from the participant pop-

ulation to be able to create a profile for each participant and possibly use this information

for improving the predictive models. This additional information was obtained during the

registration phase when the participant signed up for the study. A relevant list of profiling

questions can be found in Table 3.1. For question 8 the participants could choose between

left or right. For question 9 the participants could choose either low (typing at a rate less

than 180 keystrokes per minute), intermediate (typing at a rate higher than 180 keystrokes

per minute and less than 300 keystrokes per minute) or high (typing at a rate higher than 300

keystrokes per minute). For question 10 the participants could choose either low (elementary

use), intermediate (experience with a large set of applications and a basic understanding of the

operating system in use) or high (very experienced in programming and a good understanding

of computer logic).

There were 22 participants originally partaking in the study but only 14 of them were suf-

ficiently active (>10 labeled samples). The data of the less active participants was removed

and only the responses to the questions of Table 3.1 of the participants that were sufficiently

active will be presented.

Of the 14 active participants, 12 were male and 2 were female. The age distribution can be

36


Table 3.1: Demographic questions during registration

1 Gender

2 Birthdate

3 Occupation

4 Education

5 Nationality

6 First language

7 Most used language on this computer

8 Dominant hand

9 Typing skills

10 Computer skills

11 Percentage of total computer time that is spent on this computer

12 Keyboard layout

13 Mouse type

14 Computer type

found in Figure 3.1. All participants were students and the specific education distribution can

be found in Figure 3.2. All participants were Belgian and had Dutch as their first language.

All participants mostly used Dutch on their computers except for 1 participant who used

English as much as Dutch. 12 participants were right handed while 2 participants were left

handed. The distributions for typing skills, computer skills and percentage of total computer

time that is spent on the computer used for the study can be found in Figure 3.3. All

participants used the AZERTY keyboard layout. The distribution of the mouse type can be

found in Figure 3.4. Note that it is possible that a participant uses multiple pointing devices.

One participant used a desktop computer while the other 13 participants used laptops.

3.3 Software

The data collection software was a Windows application, called Behaviour Analyst, developed

in the C# language using the .NET 4.5 framework and was tested on Windows 10 but should

work in any Windows environment that supports .NET 4.5.

3.3.1 Installation, Automatic Updates and Heartbeat

The software was distributed through a website where participants could download an in-

staller package and view an easy and detailed installation manual (see Appendix D). The

installer package was developed using the Nullsoft Scriptable Install System5 and presented

5NSIS: http://nsis.sourceforge.net/

37

http://nsis.sourceforge.net/


20 21 22 23 24

1

2

3

4

5

6

Age

Am

ou

nt

Figure 3.1: Age distribution of participants

Econo

mics

Engin

eerin

g&

Com

pute

rSc

ienc

es

Psych

olog

yLaw

Politi

cal Sc

ienc

es

Hist

ory

Med

icin

e&

Hea

lthSc

ienc

es

1

2

3

4

5

Education

Am

ou

nt

Figure 3.2: Education distribution of participants

38


low intermediate high

1

2

3

4

5

6

7

8

9

Am

ount

Typing

Computer

(a) Distribution of participants typing

skills and computer skills

50-5

9%

60-6

9%

70-7

9%

80-8

9%

90-1

00%

1

2

3

4

5

6

7

Am

ount

(b) Time spent on the computer

Figure 3.3: Distributions of skills and time spent on the computer of participants

trackpad external

1

2

3

4

5

6

7

8

9

10

Mouse type

Am

ount

Figure 3.4: Mouse type distribution of participants

39


the participant with a short license agreement before installing the software on the system,

adding it to the startup programs in the operating system so that it would also run after

a reboot and adding configuration settings to the Windows registry. Screenshots of the in-

stallation process are shown in Figure 3.5. After the installation finished, the software was

automatically started in the background. The software is only visible as a small icon in the

system tray (see Figure 3.6). The participant had to doubleclick this icon to make a control

panel pop up. This is shown in Figure 3.7. This control panel contained some configuration

settings, including the participants id. Initially, this was empty and had to be filled in by

each participant manually. The id that needed to be filled in was automatically generated

by the download page and included in the installation manual. When the participant had

filled in the correct id and clicked the Apply-button, the control panel disappeared and the

installation was finished.

(a) Introduction page (b) License agreement

(c) Installation destination (d) Finish and startup

Figure 3.5: Software installation

40


Figure 3.6: Icon in the system tray

Figure 3.7: Control panel

The software ran a check for available updates each time it was started and then following

each 30 minutes. This was done by downloading an XML file from an update server that

contained information on the most recent version of the software. If the software detected

that it was not up-to-date, it would immediately download the installer package and run it

in silent mode. The installer package made sure that the software was shut down properly

before updating the necessary files and starting the software again. The participants did not

notice anything of the whole updating process.

While performing a check for available updates, a heartbeat was also automatically sent to

a heartbeat server. This heartbeat contained the participant’s id and the software version

being used.

3.3.2 Capturing Data

Three types of data were captured: keystrokes, mouse movements and location data. Weather

information was derived from the location data.

41


Keystrokes

The data collection software used a low-level Windows function accessed by unmanaged code

in C# allowing to process each keystroke before it is forwarded to the intended application.

All capturing of keystrokes was handled by a monitor class. This class also makes use of a

time window to make sure that the participant is not bothered with emotional questionnaires

too often or when keystroke data is not relevant anymore. It does this by keeping separate log

files for each window and keeping track of the current window. The window is always called

after the Unix timestamp6 (in milliseconds) of the first keystroke record that is contained

by it and the corresponding log file is called k[first window timestamp].log. This window can

expire when there have been no keystrokes for a long time. This is called the expiration time

in the software and can be set in the control panel. In this research, a value of 1 hour (or

3600000 ms) was used. When a keystroke is captured, the monitor first checks whether the

time that has passed since the last keystroke is smaller than the expiration time. If this is not

the case, the active window is terminated and the data contained by it is sent to the server,

using a secured connection, without asking the participant to indicate his emotional state,

and a new window is started. Captured data without emotion information is called unlabeled

data. When the active window is not expired, the monitor checks whether the keystroke is

either a keydown or keyup event. Next, a line containing information about the keystroke

(timestamp, keydown/keyup, key value, virtual key code and the title of the active window)

is added to the log file of the window.

Then, the monitor will check whether the active window contains a certain amount of keystrokes

that are keydown events. This is called the keystroke threshold and can also be set in the

control panel. In this research, a value of 1000 keydowns was used. The monitor will also

check whether the time that has past since the start of the window is larger than the so-

called time threshold. The time threshold makes sure that participants are not bothered too

often. It can be set in the control panel and in this research a value of 1 hour was used.

Only when both of these conditions are met, the participant is asked to fill out an emotional

questionnaire. This is done using a native Windows notification (see Figure 3.8). When this

notification is ignored, the data contained by the window is again sent to the server, using a

secure connection, as unlabeled data and a new window is started.

Figure 3.8: Questionnaire request notification

6Unix time: https://en.wikipedia.org/wiki/Unix_time

42

https://en.wikipedia.org/wiki/Unix_time


Mouse movements

Mouse movements are also captured by the same monitor class using a low-level Windows

function accessed by unmanaged code in C#. However, not all movements are logged. The

monitor uses a so-called mouse frequency which is expressed in Hz (1/s) and can be set in

the control panel. This frequency determines how often mouse movements should be logged.

Each time the monitor captures a mouse movement, it checks whether the time (in ms) that

has passed since the last mouse movement in the active window is greater than or equal to1000

mousefrequency . If this is the case, the coordinates of the mouse are logged together with a

timestamp. In this research, the mouse frequency was set to 1 Hz. The log files for the

mouse movements are stored separately from the keystroke log files. However, they are also

split according to the keystroke windows, have a similar naming structure (m[first window

timestamp].log) and are also sent to the server when the window is terminated (for both the

unlabeled and labeled cases).

Location and Weather

For capturing the participant’s location information, the software uses the GeoCoordinate-

Watcher class of the .NET framework. The location of the participant is kept up-to-date

asynchronously and each time a location update is performed, the weather information for

the new location is also requested. Weather information is obtained through the OpenWeath-

erMap7 API. This API takes coordinates in the form of (latitude, longitude)-pairs and returns

weather information such as temperature, humidity, air pressure and more.

3.3.3 Questionnaire

When the participant receives a notification that requests to indicate his emotional state and

clicks this notification, the participant is presented with a small questionnaire form. This form

is shown in Figure 3.9. It contains three sliders, one for each dimension of the PAD emotion

model. These can be used by the participant to indicate his emotional state. The participant

can also view a window containing extra explanation on how to indicate emotions in the PAD

model (the Tips window, see Figure 3.10), when clicking on the button with the question mark

on it. This Tips window is also shown automatically when the participant clicks a notification

for the first time before he is presented with the questionnaire form. When the participant has

indicated his emotional state, the Submit-button can be pressed and the window containing

the keystroke and mouse data is packed into a data package together with the location and

weather information at that moment. Furthermore, the data package contains a timestamp

of the moment when the Submit-button is clicked as well as the time that has past between

clicking the notification and clicking the Submit-button (called delay, to indicate how much

7OpenWeatherMap: http://www.openweathermap.org

43

http://www.openweathermap.org


time the participant needed to indicate his emotional state). This data package is sent to the

server using a secured connection.

Figure 3.9: Emotion questionnaire form

Figure 3.10: Tips window

When a participant clicks the notification, but then decides to close the questionnaire form

before submitting any emotional state information, the data package is still created but will

not contain timing information. It will then be sent to the server as unlabeled data, again

using a secured connection. An unlabeled data package has the values -1 for each PAD emotion

dimension and 0 as questionnaire delay value. When the participant is not connected to the

internet or when the software for some reason fails to transmit data to the server, the data that

44


needs to be sent is saved in a buffer file. This happens for all cases of data transmission (both

unlabeled and labeled data). Each time the software is started, it will first check whether

there is still data in the buffer file and will try to resend it if this is the case. This is why it

is interesting to keep the timestamp of the moment when the Submit-button is clicked. In

case the data is sent at a later time than when the data was submitted, the time of the data

package arriving at the server will be different from the time when it was submitted.

All log files are also deleted each time the software is started so no unnecessary space is used.

3.3.4 Server and Database

The entire server program was programmed using PHP. All data is stored in a MySQL

database on the server. A complete database schema can be found in Figure 3.11. For

all server requests that require database operations, these operations were contained in a

database transaction to make sure that requests were either completely executed (commit)

or failed without affecting the database otherwise (rollback). There are two main processing

parts: the data processor and the heartbeat processor.

Figure 3.11: Database schema

The heartbeat processor accepts POST requests containing a user and version field. Such

a request is called a heartbeat. On reception of each heartbeat, the server updated the

timestamp of the last heartbeat received of the corresponding participant, as well as the

software version the participant was using. This way, it could easily be seen if software

problems occurred at the user’s side. This will be discussed in more detail below. The

45


data processor processes all data packages (either labeled or unlabeled) that are sent by

the software. It accepts POST requests containing user, pleasure, arousal, dominance,

time, delay, latitude, longitude, keystrokeData, mouseData and weatherData fields. The

keystrokeData, mouseData fields contain JSON8-encoded data that represent collections of

keystroke and mouse movement objects respectively. The weatherData field contains the

JSON-encoded data that is received from the OpenWeatherMap API. First, the request fields

are parsed and a questionnaire record is inserted into the database. The JSON-encoded data

from the weatherData field is directly inserted into the corresponding record field without

further parsing. All data that is inserted into the database fields is escaped to prevent SQL

injection9. Next, the id of the newly inserted record is used as the reference (ref) value for

inserting keystroke and mouse records into the database for each object that is contained in

the corresponding JSON collection.

The server responds to each request by printing a JSON-encoded array that contains a field

with an error code and a corresponding error description. If no errors occurred while process-

ing the request, the error code is 0 and the description will be empty. This server response

is also used by the software to provide feedback to the participants when server errors oc-

cur and to determine whether the data package needs to be stored in the buffer for later

retransmission.

3.3.5 Problems

During the entire course of the study, some problems occurred with the software. These

were mainly small bugs that were quickly fixed. Most of these small bugs only occurred

in very specific cases which resulted in most participants not experiencing any trouble. A

disadvantage in the beginning of the study was the fact that for each software update, all

participants had to be contacted individually for a manual update. This was the main reason

for adding the automatic updates to the software. Using this feature, bug fixes could be

deployed very quickly and efficiently without having to bother the participants.

Another problem that posed itself in the beginning of the study when the software was rolled

out was that there did not exist a simple installation process yet. Originally, a zip10-file,

containing the necessary software files and a command prompt file that opened the Windows

startup folder, needed to be downloaded. Each participant then had to manually extract the

files to a desired destination, create a shortcut to the software in the Windows startup folder

and run the software. This was difficult for many less experienced participants and required

a lot of assistance from the study administrator. Also, the Windows registry was not used

to store configuration settings. Instead, these were stored in a configuration file which made

8JavaScript Object Notation http://www.json.org/9SQL injection http://www.w3schools.com/sql/sql_injection.asp

10Zip file format https://en.wikipedia.org/wiki/Zip_(file_format)

46

http://www.json.org/

http://www.w3schools.com/sql/sql_injection.asp

https://en.wikipedia.org/wiki/Zip_(file_format)


it harder to deploy specific updates regarding configuration settings. Both of these problems

resulted in the creation of a simpler installation process using an NSIS installer and by using

the Windows registry for storing configuration settings.

Participants that had some sort of antivirus software installed on their computers often expe-

rienced the problem that the data collection software was marked as malware and therefore

automatically removed. The heartbeat feature offers a convenient solution here as it could

easily detect when it has been a long time since a heartbeat was received from a participant.

Indeed, this way many cases of automatic software removal by antivirus software were de-

tected and caused the participant to be asked to reinstall the software and continue the data

collection process.

47

Chapter 4

Data Processing

Before models could be built to recognize emotions based on the data that was gathered, this

data needed to be processed and features had to be extracted. This chapter first presents

an overview of specific considerations made before the feature extraction process. Next, the

particular keystroke, mouse and context features that were extracted are discussed as well as

the process of handling the data labels.

4.1 Data Formatting

As mentioned before, all data was persisted in a MySQL database on a data collection server.

It can be seen from Figure 3.11 that this data is structured in a hierarchical fashion. Each

subject record has a number of associated questionnaire records, which in turn have a number

of associated keystroke and mouse records. To be able to work with the data, an efficient way

of navigating through this structure is required. Converting the data to an object-oriented

structure provides an ideal solution. For each participant, a user object can be created that

contains their personal data as its attributes. Furthermore, each user object contains a set of

questionnaire objects that they have submitted. In turn, each questionnaire object contains

the specific questionnaire data, a set of keystroke objects and a set of mouse movement

objects. Python1 was chosen as the main programming language. To convert the data from

the database to such an object-oriented structure, all data can be retrieved from the server

by issuing SQL requests, creating the corresponding objects and linking them to each other.

However, this is a very time-consuming process due to the delay caused by issuing the SQL

requests. Instead, all data was downloaded from the server in a CSV-format2. These CSV-

files could then be read very fast using Python’s csv module3. Next, after having defined

the proper Python classes, the corresponding objects were created and linked in such a way

1Python: https://www.python.org/2Comma-separated values: https://en.wikipedia.org/wiki/Comma-separated_values3Python csv module: https://docs.python.org/2/library/csv.html

48

https://www.python.org/

https://en.wikipedia.org/wiki/Comma-separated_values

https://docs.python.org/2/library/csv.html

Chapter 4. Data Processing

that a structured set of objects, which could be easily navigated, was obtained. To make the

process of loading the data even faster, the cPickle module4 was used to serialize the data

structure and persist it to a file that could be easily deserialized as well.

Some special cases needed to be considered during the processing of the data. To be able to

calculate keystroke timing features, corresponding keydown and keyup events needed to be

matched. However, some keydown events did not have a corresponding keyup event, which

caused problems for extracting timing features. This problem is the result of a participant

holding down a key for an extended period of time causing multiple keydown events and only

one keyup event. This is often seen when a user wants to type the same character a lot of times

or when holding down the shift key. This pattern was resolved by not taking into account the

intermediate keydown events between the initial keydown event and the keyup event. It could

also happen that a keydown event does not have a corresponding keyup event or vice versa

as a result of the keystroke window that is used during monitoring. For example, when a

keydown event is received by the software and the conditions for closing up the current window

are met, the keydown event will still be in the current window that is being closed while the

corresponding keyup event will be in the new window. Such events were also ignored by the

data processing scripts. Modifier keys and toggle keys also presented some challenges. Each

key on the keyboard has a unique virtual keycode (the vkcode) associated with it. However,

the exact character that a key represents when pressed depends on the state of the modifier

and the toggle keys. For example: the A-key represents a lowercase ’a’ in normal cases but

an uppercase ’A’ when the shift key is pressed at the same time or when the caps lock key

is toggled. Both modifier keys and toggle keys are handled by the data processing scripts

by defining state variables. When a keydown event for a modifier key or toggle key is seen,

the corresponding state changes. For modifier keys, the corresponding changes again when a

keyup event is seen. This is not the case for toggle keys.

4.2 Feature Extraction

The features that were extracted from the entire dataset can be divided into four categories:

keystroke features, textual content features, mouse movement features and contextual fea-

tures. This section discusses each of these categories in detail.

4.2.1 Keystroke Features

The keystroke data, which was obtained during this study, consisted out of keydown and keyup

events, accompagnied by other characteristics, as explained in Chapter 3. The features, that

were extracted from this data, consist of timing features and frequency features.

4Python cPickle module: https://docs.python.org/2/library/pickle.html

49

https://docs.python.org/2/library/pickle.html


Timing Features

A list of all timing features that were extracted from the keystroke data can be found in

Table 4.1. These features were chosen based on related work.

Table 4.1: Keystroke timing features

Name Description

AvgTypingSpeed The amount of keystrokes in the sample per second.

D2D latency The duration from a keydown event to the next keydown

event.

U2U latency The duration from a keyup event to the next keyup event.

U2D latency The duration from a keyup event to the next keydown

event.

D2U duration The duration from a keydown event to the corresponding

keyup event of that key.

WeightedMean D2D latency The weighted mean of the D2D latencies.

WeightedMean U2U latency The weighted mean of the U2U latencies.

WeightedMean U2D latency The weighted mean of the U2D latencies.

WeightedMean D2U duration The weighted mean of the D2U durations.

Notice that the D2D latency, U2U latency, U2D latency and D2U duration features are actu-

ally lists of values. In these lists, there is a value for each consecutive pair of keydown events,

keyup events, keyup/keydown events and keydown/keyup events respectively. U2U and U2D

latencies can be negative. For example, when a participant is typing quickly, the second key

in a pair of two keypresses may be released before the first key. This can happen when a

participant presses each key with another finger or hand. It is also possible that the second

key in a pair of two keypresses may be pressed before the first key is released. The negative

values were kept in the lists. Furthermore, as mentioned before, extra keydown events re-

sulting from holding down a key for an extended period of time were not taken into account

while calculating these features. For each of these value lists, different summary statistics

were used to obtain single feature values. The summary statistics that were used were: mean,

maximum, minimum, standard deviation, variance, mode, median, skew and kurtosis. A

number of these statistics have also been used in related work and others were used because

they describe the distribution of the value lists (e.g. skew, kurtosis, standard deviation). For

the weighted mean features, the weighting was performed according to the inverse of the time

that had passed between the keystroke event and the submission of the sample. The usage of

weighted means as features is motivated by the possibility that more recent keystroke data,

relative to the moment that the sample was submitted, is more indicative for the sample label

compared to older keystroke data.

50


An outlier removal process was also performed for the calculation of the D2D latency, U2U latency,

U2D latency and D2U duration feature by using the interquartile range (IQR) of the values

in the lists. This was necessary such that extremely long latencies and durations, caused by

long pauses or incorrectly matched keystrokes, would not be taken into account. For each

list of values, the IQR was calculated and all values higher than the third quartile value +

1.5 times the IQR were removed from the list and not taken into account while calculating

the summary statistics. This is a commonly used technique for outlier removal [59]. The

parameter of 1.5 times the IQR could be easily chosen in the scripts.

In total, 41 keystroke timing features were extracted. Their box-and-whisker plots can be

found in 4.1.

Figure 4.1: Box-and-whisker plots for keystroke timing features

Frequency Features

A list of all frequency features that were extracted from the keystroke data can be found in

Table 4.2. These features have been chosen based on related work.

Note that the NumChar freq, AlphabetChar freq, Space freq, Return freq, Punctuation freq

and AvgWordLength features only give an indication of the text that was actually typed with

the keyboard and not of all the text that was used. For example, it is possible that a partic-

ipant copy-pastes text from a source and does not actually type any of the used characters.

These pieces of copy-paste text are not included in the features. Also the Error freq feature

does not include all forms of error correction. A participant may select a piece of text that

is longer than one character and press the delete or backspace key only once to remove the

51


Table 4.2: Keystroke frequency features

Name Description

NumChar freq The frequency of numerical characters in the sample.

AlphabetChar freq The frequency of alphabetical character in the sample.

Del freq The frequency of the usage of the delete key in the sample.

Backspace freq The frequency of the usage of the backspace key in the sample.

Errors freq The frequency of errors (sum of deletes and backspaces) in the sample.

Shift freq The frequency of the usage of one of the shift keys in the sample.

Space freq The frequency of the usage of the space key in the sample.

Arrow freq The frequency of the usage of one of the arrow keys in the sample.

CapsLock freq The frequency of the usage of the caps lock key in the sample.

Return freq The frequency of the usage of one of the return keys in the sample.

Punctuation freq The frequency of the usage of a punctuation symbol (’.’, ’,’, ’?’, ’ !’,

’:’ or ’;’) in the sample.

AvgWordLength The average length of the words in the sample.

LongPause freq The frequency of long periods without typing in the sample.

entire piece of selected text instead of pressing the delete or backspace key once for each

character that needs to be removed. A participant may even select a piece of text and start

writing other text to correct an error without ever pressing the backspace or delete key. The

LongPause freq feature is calculated by calculating all U2D latencies and then calculating the

frequency of values that are higher than a certain threshold value. This threshold was put at

60 seconds, because this seemed like a realistic threshold for detecting non-continuous typing,

and could be easily set in the script.

In total, 13 keystroke frequency features were extracted. Their box-and-whisker plots can be

found in 4.2. The box-and-whisker plot for the CapsLock freq was left out as these feature

values were 7.9% of the time not equal to zero such that its box-and-whisker plot would

obfuscate the figure.

52


Figure 4.2: Box-and-whisker plots for keystroke frequency features

4.2.2 Textual Content Features

Before extracting textual content features, the pieces of text contained by each sample needed

to be reconstructed. Remember that only a set of keystroke events accompagnied by other

characteristics is available. To obtain the entire text content, only the keydown events were

considered as these are the ones that give rise to the production of characters. The states

for the modifier and toggle keys also needed to be kept, as explained in the previous section.

These states were taken into account when determining which exact character is represented

by each keystroke event. Special characters that are not alphanumeric (except for spaces and

new lines) are ignored and not included in the reconstructed text. Backspaces are handled by

removing the last character of the text string at the moment when the backspace key occurs.

When the pieces of text have been reconstructed for each sample, textual content features

need to be extracted. The sklearn package5 for Python has a module containing feature

extraction methods for text data. The first step is to convert the text pieces to a matrix of

token counts, i.e. bag of words, which produces a sparse representation of the counts. This

matrix can then be transformed to a normalized term frequency or TF-IDF representation.

The number of features that is produced by this technique depends on the collection of text

pieces that is used as input.

5Scikit-learn: http://scikit-learn.org/

53

http://scikit-learn.org/


4.2.3 Mouse Movement Features

One mouse movement feature was extracted, based on related work: the average mouse speed.

To obtain this feature, the euclidian distance between each two mouse movement records in

a sample was calculated and divided by the time that has passed between moving from the

first position to the second position. The average of these speeds is the average mouse speed.

4.2.4 Contextual Features

As some contextual information was also collected, such as location and weather data, during

the study, this data could be used as well. The actual coordinates of participants are not

particularly interesting as these will probably be different for each participant. Weather

data is however less variable and can be used to infer a participant’s mood [32]. More

specifically, the temperature, pressure and humidity from this weather data were extracted.

The discomfort index was also used as a contextual feature. The formula for this index is as

follows: DI = T − (0.55 ∗ (1− (0.01 ∗H)) ∗ (T − 14.5)) where T is the temperature expressed

in ◦C and H is the relative humidity expressed in percents. The box-and-whisker plots can

be found in Figure 4.3.

Figure 4.3: Box-and-whisker plots for contextual features

Due to the fact that some of the participants have disabled the location services on their

computer, not all samples contain weather data as this depends on the location of the partic-

ipant. Thus, these features contain missing data and the models need to be able to deal with

this in order to use them.

54


In total, 1177 labeled samples were collected and 59 features were extracted. These can be

used to build machine learning models for emotion recognition.

55

Chapter 5

Model Building

In this chapter the different models that were built are discussed. A first subdivision is made

based on whether they are a regression or classification model. Next, a second subdivision is

made based on whether the model is built using data of all participants or whether separate

models are built, one per participant. Finally, a third subdivision is made based on whether

the model is built using keystroke dynamics data, mouse dynamics data and contextual data

or if the model is built using textual content data. For the former model, i.e. dynamics

models, random forest models were used with 500 trees and a maximum number of features

used per tree that is equal to the square root of the total amount of available features. The

random forest algorithm is a robust algorithm that is not prone to overfitting. It is known

to outperform the SVM algorithm and requires little preprocessing. For the latter models,

i.e. models using textual content data, SVMs with a linear kernel were used. SVMs have the

advantage of being able to deal with a high dimensional dataset, as this is the case for textual

content. The optimization of the model hyperparameters will be discussed later. For each

model type, different approaches were used to obtain a good model. For the model evaluation,

10-fold cross-validation was used.

5.1 Regression

This section describes the approaches that were used to build regression models and is orga-

nized according to the second and third model type subdivisions, as described above. The

goal of regression models is to predict one or more axis values in the PAD emotion model. To

evaluate the different models, a number of error measures were used. These are the R2 score,

the mean absolute error (MAE), the mean squared error (MSE) and the explained variance

score (EVS). The R2 score is the most important metric as it presents a good view on how

much better the model performs compared to always taking the mean value as a prediction.

The EVS is very similar to the R2 score and the MAE and MSE present a more natural

interpretation of the model performance.

56

Chapter 5. Model Building

5.1.1 General Dynamics Models

First, the general dynamics regression models are described. These are models that are built

using keystroke dynamics data, mouse dynamics data and contextual data of all participants.

Different approaches were used to build multiple models. Three degrees of freedom were used.

The first is whether contextual data, i.e. weather data, is used. This yields different results

as some participants did not have their location services enabled which caused their data not

to contain weather information. As the random forest implementation that is used is not

capable of dealing with missing data, this means that all samples that do not contain the

weather information need to be removed if this information needs to be taken into account.

The second degree of freedom is whether a model is built for each separate dimension of the

PAD model or whether one model is built to predict the entire PAD-tuple (joint). The third

degree of freedom that was used is whether feature selection was performed or not. To perform

feature selection, a random forest model was built using all features and then the 20 most

important features were extracted. Then the model for the actual regression tasks was built

using only these features. In total, 8 different random forest models for regression1 were built

and evaluated. The results can be found in Table 5.1. The best result is obtained with the

combination of including weather data, using the joint PAD space and using feature selection

(bold). This model was built using 630 labeled samples. The metrics that are presented were

calculated for each dimension of the PAD space and then averaged to obtain a result for the

entire model. Note that the best model obtains a R2-score of 0.1766 which is only slightly

better than always predicting the mean. The average MAE is 17.66 and the average of the

standard deviation on these MAE over the different dimensions is 11.767. This indicates that

this model does not perform very well.

1Random Forest Regressor http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.

RandomForestRegressor.html

Table 5.1: General dynamics regression models results

Weather PAD Features R2 MAE MSE EVS

No

JointAll 0.143689528 17.78790994 453.023691 0.144182514

Best 20 0.161537472 17.54150949 443.7466007 0.161826172

SeparateAll 0.144214673 17.729484 452.6435168 0.144586658

Best 20 0.146617402 17.66438006 451.4649129 0.146919884

Yes

JointAll 0.14692657 18.16206667 469.8446163 0.14715312

Best 20 0.176618405 17.66490053 453.380924 0.17674802

SeparateAll 0.153372875 18.01380741 466.3389916 0.153531582

Best 20 0.163933007 17.85844233 460.7601332 0.164102493

57

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html


5.1.2 Individual Dynamics Models

The individual dynamics regression models are very similar to the general dynamics regression

models except that they are built using data of only one participant at a time instead of

using data of all participants. Individual models were built for 8 participants as only these

participants who provided at least 59 samples were included, such that the amount of samples

was at least equal to the number of possible features. The same degrees of freedom were used

to build multiple variants of the models and the evaluation results can be found in Table 5.2.

Again, the best result on average is obtained with the combination using the joint PAD space

and using feature selection, but now without including weather data. The metrics that are

presented were calculated for each participant and for each dimension of the PAD space and

then averaged to obtain a result for the entire model. Note that the best averaged model

(bold) obtains a R2-score of 0.048 which is not significantly better than always predicting the

mean. The average MAE is 16.19 and the average of the standard deviation on these MAE

over the different participants and dimensions is 10.55. This again indicates that this model

does not perform very well. The fact that not including weather data now yields a better

result, can probably be explained by the fact that only 3 participants provided more than 59

samples that contained weather information, as can be seen in Table 5.3.

5.1.3 General Text Content Models

To build general text content regression models, the text content feature data of all partici-

pants is used to train a SVM for regression2. There were no degrees of freedom in these models

as the SVM implementation that was used is not capable of predicting the entire PAD-tuple,

so different models for each dimension were built. Furthermore, all textual content features

are equally important by definition so there can be no degrees of freedom here. The detailed

evaluation result can be found in Table 5.4. This model was built using 1177 labeled samples.

2Epsilon-Support Vector Regression: http://scikit-learn.org/stable/modules/generated/sklearn.

svm.SVR.html

Table 5.2: Individual dynamics regression models results

Weather PAD Features R2 MAE MSE EVS

No

JointAll 0.008913228 16.65004196 411.5193209 0.009823612

Best 20 0.048052021 16.18678528 394.3835932 0.048783084

SeparateAll 0.003724599 16.63586158 414.2228234 0.004583854

Best 20 0.027230397 16.34792232 404.0040023 0.027999946

Yes

JointAll -0.046333662 17.7818626 478.6445918 -0.045494928

Best 20 -0.011577862 17.28880378 459.9817373 -0.011153212

SeparateAll -0.05383741 17.87304252 482.1243925 -0.052686205

Best 20 -0.023891601 17.42099593 464.6577326 -0.022783861

58

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html


Table 5.3: Number of samples per participant in individual dynamics models (both regression and

classification)

User Amount of samples

Weather data excluded Weather included

1 71 0

2 79 57

3 82 82

4 249 157

5 118 39

6 66 0

7 93 0

8 230 156

Total 988 491

The table presents the metrics that were calculated for each dimension of the PAD space and

also the averaged metrics that are used to obtain a result for the entire model. Note that a

R2-score of only 0.036 is achieved, which is only slightly better than always predicting the

mean. The average MAE is 19.69 and the average of the standard deviation on these MAE

over the different dimensions is 10.89. This indicates that this model does not perform very

well.

5.1.4 Individual Text Content Models

The individual text content regression models are very similar to the general text content

regression models but are built using data of only one participant at a time instead of using

data of all participants. Individual models were built for 13 participants. The evaluation

results can be found in Table 5.5. The table presents the metrics that were calculated for

each participant and for each dimension of the PAD space and also the averaged metrics

that were used to obtain a result for the entire model. Note that an averaged R2-score of

-0.05 is obtained, which is worse than always predicting the mean. The average MAE is

17.63 and the average of the standard deviation on these MAE over the different participants

and dimensions is 10.71. This indicates that this model performs very bad. The number of

samples used for each participant can be seen in Table 5.6.

5.1.5 Fuzzy Logic

Predicted PAD axis values need to be interpreted as an emotional state, or better yet, a

weighted combination of multiple emotional states. This requires mapping discrete emotional

states onto the PAD space. Different mappings have been proposed and empirical approaches

to building such mappings have been taken [31]. When we assume that a person experiences

a weighted combination of multiple emotions, fuzzy logic can be used to determine the extent

59


Table 5.4: General text content regression models results

R2 0.035591173

MAE 19.69351149

MSE 510.2788305

EVS 0.04181899

P-R2 0.012589376

P-MAE 19.66656626

P-MSE 484.3517047

P-EVS 0.026537171

A-R2 0.025445702

A-MAE 21.57589911

A-MSE 568.0424353

A-EVS 0.029893784

D-R2 0.06873844

D-MAE 17.83806909

D-MSE 478.4423514

D-EVS 0.069026015

Table 5.5: Individual text content regression models results

R2 -0.050807597

MAE 17.63898515

MSE 456.5681924

EVS -0.034470865

P-R2 -0.037283712

P-MAE 17.88843121

P-MSE 455.8425914

P-EVS -0.026363772

A-R2 -0.057245424

A-MAE 19.62208148

A-MSE 516.8555083

A-EVS -0.040458861

D-R2 -0.057893653

D-MAE 15.40644276

D-MSE 397.0064775

D-EVS -0.036589962

60


Table 5.6: Number of samples per participant in individual text content models (both regression and

classification)

User Amount of samples

1 31

2 18

3 249

4 93

5 33

6 230

7 118

8 41

9 66

10 79

11 82

12 71

13 41

Total 1152

to which each emotion is present, given PAD axis values. This can be done by defining a set

of fuzzy rules that define the relationship between the PAD axis values and each emotion.

For example, if the mapping, presented in Table 5.7, is used, the following fuzzy rules can be

defined:

1. IF pleasure IS negative AND arousal IS negative AND dominance IS negative

THEN bored IS present

2. IF pleasure IS negative AND arousal IS negative AND dominance IS positive

THEN disdainful IS present

3. IF pleasure IS negative AND arousal IS positive AND dominance IS negative

THEN anxious IS present

4. IF pleasure IS negative AND arousal IS positive AND dominance IS positive

THEN hostile IS present

5. IF pleasure IS positive AND arousal IS negative AND dominance IS negative

THEN docile IS present

6. IF pleasure IS positive AND arousal IS negative AND dominance IS positive

THEN relaxed IS present

7. IF pleasure IS positive AND arousal IS positive AND dominance IS negative

THEN dependent IS present

61


8. IF pleasure IS positive AND arousal IS positive AND dominance IS positive

THEN exuberant IS present

The fuzzy terms ”negative”, ”positive” and ”present” can then be defined by membership

functions that take values between 0 and 1. Example membership functions for these terms

are presented in Figure 5.1. The IS-operator calculates the function value for the variable

on its left-hand side using the membership function for the fuzzy term on its right-hand

side. The AND-operator can be implemented using a so-called t-norm and also for combining

multiple rules concerning the same fuzzy terms and defuzzification, there are multiple possible

approaches [67]. Applying such fuzzy rules to the predicted PAD axis values results in a set

of membership values for each emotion that can be interpreted as the extent to which each of

these emotions are present in the current emotional state of a person. Using fuzzy logic has

the advantage of reducing the importance of the accuracy of predicted PAD axis values as

less accurate values can still result in a correct conclusion concerning the dominant emotion

due to the fact that fuzzy logic has the capacity to take into account the inherent fuzziness

of the emotion information. Furthermore, it is also possible to draw conclusions about which

emotions are likely to occur at the same time.

5.2 Classification

This section describes the approaches that were used to build classification models and is again

organized according to the second and third model type subdivisions, as described above. To

be able to perform classification, the PAD emotion model first needs to be divided into a

number of different classes. Three different approaches were used to divide the PAD emotion

model in different classes. The first approach uses the k-means clustering algorithm to assign

all samples that belong to the same cluster to the same class so that k classes are obtained.

The value of k was chosen to be 8 to divide the entire PAD space into different classes. This

choice is made based on the fact that the PAD space is three-dimensional and thus contains

8 octants. The value of k was chosen to be 2 to divide only one PAD dimension into different

Table 5.7: PAD mapping according to octants

PAD octant Emotion

P-A-D- Bored

P-A-D+ Disdainful

P-A+D- Anxious

P-A+D+ Hostile

P+A-D- Docile

P+A-D+ Relaxed

P+A+D- Dependent

P+A+D+ Exuberant

62


30 70

0.5

1

x

membership

negativepositive

(a) Terms ’negative’ and ’positive’

0.5 1

0.5

1

x

present

(b) Term ’present’

Figure 5.1: Membership functions

classes. The second approach splits each PAD dimension into two parts (a negative and a

positive part, respectively having values between 0 and 50 and between 50 and 100). This

way, two classes are obtained per dimension or 23 = 8 classes for the entire PAD space. The

third approach is similar to the second one but splits each PAD dimension into three parts (a

negative, a neutral and a positive part, respectively having values between 0 and 40, between

40 and 60 and between 60 and 100). This yields three classes per dimension or 33 = 27

classes for the entire PAD space. The goal of classification models is then to predict the

correct class for each sample. To evaluate the different models, a number of error measures

were used. These are the weighted precision, weighted recall, weighted F1-score (incorporates

both precision and recall) and weighted ROC AUC score (area under the receiver operating

characteristic curve). Since the classes are highly imbalanced in both label sets, the accuracy

score is not included. For example, if the classifier labels all the samples as the majority class,

a good accuracy score would still be achieved, though in fact the classifier was not able to

learn the underlying pattern of the data. The confusion matrices will also be presented for

the best results.

5.2.1 General Dynamics Models

Different approaches were used to build multiple general dynamics classification models. Six

degrees of freedom were used. The first is whether contextual data, i.e. weather data, is

used, as described above. The second degree of freedom is again whether a model is built

for each separate dimension of the PAD model or whether one model is built to predict the

entire PAD-tuple. The third degree of freedom is whether a class system based on the k-

means clusters is used, a system based on positive-negative-scales is used or a system based

on positive-neutral-negative-scales is used. The fourth degree of freedom is whether no class

63


weight balancing is performed, general class weight balancing is performed or subsample

class weight balancing is performed. General class weight balancing associates weights with

classes that are inversely proportional to class frequencies in the input data. Subsample

class weight balancing essentially does the same thing but adjusts the class weights according

to the class frequencies in the bootstrap sample for each tree grown. The fifth degree of

freedom is whether or not subsampling is performed in order to reduce bias due to class skew.

Subsampling removes random samples from each class that contains more samples than the

class that occurs least frequently until every class contains an equal number of samples. The

sixth degree of freedom that was used is whether feature selection was performed or not.

The feature selection process is again done using the feature importance deduced from a

random forest model. In total, 96 different random forest models for classification3 were

built and evaluated. The results can be found in Table 5.8. The metrics that are presented

were calculated for the entire PAD-tuple when it concerns models that use the joint PAD

space. When it concerns models that use separated PAD dimensions, they were calculated

for each dimension of the PAD space and then averaged to obtain a result for the entire

model. The ROC AUC score indicates that the best result is obtained with the combination

of not including weather data, using the joint PAD space that is divided into classes using

k-means clustering (with k = 8), not using class weight balancing, using subsampling and

using feature selection (bold). This model was built using 704 labeled samples and achieves a

ROC AUC score of 0.75 and a F1 score of 0.57 (± 0.058). The corresponding confusion matrix

is presented in Table 5.9. However, note that all models built on separate PAD dimensions

using classes determined by the k-means clusters (with k = 2) or by the positive/negative

class definition, no class weight balancing and subsampling (underlined) have similar ROC

AUC scores to this model but also have much higher precision, recall and F1 scores. So these

models actually perform better.

3Random Forest Classifier http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.

RandomForestClassifier.html

64

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


Table 5.8: General dynamics classification models results (CWB = class weight balancing, SS =

subsampling)

Weather PAD Classes CWB SS Features Precision Recall F1ROC

AUC

No

Joint

K-

means

(k=8)

GeneralNo All 0.32 0.336 0.309 0.602

No Best 20 0.301 0.318 0.292 0.591

SubsampleNo All 0.312 0.333 0.305 0.6

No Best 20 0.293 0.308 0.284 0.586

No

No All 0.318 0.335 0.304 0.6

No Best 20 0.31 0.329 0.302 0.597

Yes All 0.566 0.561 0.562 0.749

Yes Best 20 0.573 0.568 0.57 0.753

+/-

GeneralNo All 0.367 0.415 0.327 0.559

No Best 20 0.394 0.424 0.333 0.565

SubsampleNo All 0.362 0.41 0.321 0.555

No Best 20 0.397 0.42 0.329 0.561

No

No All 0.35 0.412 0.334 0.565

No Best 20 0.362 0.41 0.33 0.563

Yes All 0.4 0.409 0.401 0.662

Yes Best 20 0.4 0.412 0.404 0.664

+/0/-

GeneralNo All 0.236 0.344 0.219 0.534

No Best 20 0.239 0.343 0.22 0.534

SubsampleNo All 0.223 0.341 0.215 0.531

No Best 20 0.238 0.342 0.22 0.533

No

No All 0.261 0.346 0.234 0.544

No Best 20 0.239 0.347 0.236 0.544

Yes All 0.031 0.074 0.043 0.519

Yes Best 20 0.033 0.074 0.046 0.519

Separate

K-means

(k=2)

GeneralNo All 0.665 0.649 0.633 0.606

No Best 20 0.68 0.665 0.648 0.617

SubsampleNo All 0.669 0.655 0.639 0.611

No Best 20 0.677 0.674 0.653 0.617

No

No All 0.671 0.656 0.645 0.614

No Best 20 0.671 0.67 0.653 0.618

Yes All 0.722 0.735 0.728 0.726

Yes Best 20 0.702 0.718 0.71 0.706

+/-

GeneralNo All 0.694 0.858 0.766 0.612

No Best 20 0.696 0.863 0.77 0.616

SubsampleNo All 0.695 0.857 0.767 0.614

No Best 20 0.696 0.866 0.772 0.617

No

No All 0.696 0.84 0.761 0.613

No Best 20 0.697 0.846 0.764 0.616

Yes All 0.721 0.731 0.726 0.724

Yes Best 20 0.716 0.728 0.722 0.719

+/0/-

GeneralNo All 0.583 0.627 0.555 0.565

No Best 20 0.594 0.634 0.563 0.571

SubsampleNo All 0.592 0.632 0.561 0.571

65


No Best 20 0.61 0.637 0.568 0.575

No

No All 0.586 0.628 0.566 0.575

No Best 20 0.606 0.639 0.579 0.586

Yes All 0.586 0.586 0.586 0.69

Yes Best 20 0.54 0.541 0.54 0.656

Yes

Joint

K-means

(k=8)

GeneralNo All 0.311 0.319 0.303 0.597

No Best 20 0.319 0.325 0.314 0.602

SubsampleNo All 0.326 0.335 0.322 0.608

No Best 20 0.294 0.308 0.293 0.592

No

No All 0.332 0.333 0.317 0.605

No Best 20 0.347 0.359 0.341 0.621

Yes All 0.517 0.518 0.515 0.725

Yes Best 20 0.522 0.527 0.52 0.73

+/-

GeneralNo All 0.397 0.41 0.324 0.553

No Best 20 0.35 0.397 0.315 0.551

SubsampleNo All 0.395 0.422 0.332 0.561

No Best 20 0.374 0.413 0.333 0.562

No

No All 0.369 0.403 0.329 0.557

No Best 20 0.407 0.419 0.353 0.572

Yes All 0.266 0.288 0.27 0.593

Yes Best 20 0.321 0.327 0.317 0.615

+/0/-

GeneralNo All 0.281 0.351 0.221 0.532

No Best 20 0.236 0.343 0.218 0.529

SubsampleNo All 0.266 0.344 0.215 0.527

No Best 20 0.262 0.34 0.217 0.526

No

No All 0.24 0.341 0.231 0.535

No Best 20 0.243 0.333 0.229 0.53

Yes All 0 0 0 0.481

Yes Best 20 0 0 0 0.481

Separate

K-means

(k=2)

GeneralNo All 0.613 0.425 0.475 0.597

No Best 20 0.629 0.458 0.509 0.618

SubsampleNo All 0.603 0.416 0.465 0.591

No Best 20 0.638 0.449 0.503 0.611

No

No All 0.595 0.448 0.493 0.602

No Best 20 0.621 0.472 0.521 0.621

Yes All 0.719 0.704 0.711 0.714

Yes Best 20 0.735 0.711 0.722 0.727

+/-

GeneralNo All 0.68 0.843 0.752 0.604

No Best 20 0.691 0.84 0.758 0.616

SubsampleNo All 0.676 0.837 0.747 0.597

No Best 20 0.679 0.83 0.746 0.599

No

No All 0.677 0.819 0.741 0.596

No Best 20 0.686 0.82 0.747 0.607

Yes All 0.716 0.729 0.722 0.72

Yes Best 20 0.693 0.716 0.704 0.699

+/0/-

GeneralNo All 0.578 0.629 0.56 0.567

No Best 20 0.59 0.637 0.576 0.58

SubsampleNo All 0.582 0.63 0.56 0.567

66


No Best 20 0.599 0.642 0.584 0.588

No

No All 0.581 0.631 0.57 0.576

No Best 20 0.587 0.636 0.584 0.592

Yes All 0.544 0.542 0.542 0.656

Yes Best 20 0.591 0.59 0.59 0.692

5.2.2 Individual Dynamics Models

For the individual dynamics classification models, the same degrees of freedom as for the

general dynamics classification models were used to build multiple variants of the models.

They were built for the same 8 participants as in the individual dynamics regression models.

The evaluation results can be found in Table 5.10. The metrics that are presented were

calculated for each participant and for the entire PAD-tuple when it concerns models that

use the joint PAD space. When it concerns models that use separated PAD dimensions,

they were calculated for each dimension of the PAD space and then averaged to obtain a

result for the entire model. Note that the results for models that use the joint PAD space

in combination with the positive/neutral/negative class definition were left out as this class

definition leads to 27 possible classes and none of the participant’s data contained each of

these classes, which makes some of the measures undefined. The ROC AUC score indicates

that the best result on average is obtained with the combination of including the weather

data, using the separated PAD space in which each axis is divided into a positive, negative

and neutral class, not using class weight balancing, using subsampling and using feature

selection (bold). This is quite a different model compared to the best model that was found

in general dynamics classification. It obtains an average ROC AUC score of 0.677 and a

F1 score of 0.564 (± 0.271). However, when again the models that perform slightly worse

in terms of ROC AUC score but much better in terms of precision, recall and F1 scores in

general dynamics classification are observed, the same phenomenon is found for individual

Table 5.9: Confusion matrix for best general dynamics classification model

1 2 3 4 5 6 7 8

1 55.68% 5.68% 5.68% 7.92% 5.68% 14.8% 3.44% 1.12%

2 7.92% 54.56% 4.56% 2.24% 10.24% 3.44% 6.8% 10.24%

3 7.92% 7.92% 40.88% 5.68% 13.6% 3.44% 7.92% 12.48%

4 4.56% 7.92% 1.12% 65.92% 6.8% 2.24% 5.68% 5.68%

5 6.8% 5.68% 6.8% 3.44% 62.48% 10.24% 4.56% 0%

6 3.44% 1.12% 3.44% 2.24% 6.8% 77.28% 2.24% 3.44%

7 6.8% 9.12% 4.56% 7.92% 4.56% 6.8% 51.12% 9.12%

8 5.68% 6.8% 12.48% 4.56% 3.44% 2.24% 12.48% 52.24%

67


models. All models using the separated PAD space using classes determined by the k-means

clusters (with k = 2) or by the positive/negative class definition, no class weight balancing

and subsampling (underlined) have slightly lower ROC AUC scores but much higher precision,

recall and F1 scores. This is exactly the same family of models that was found in general

dynamics classification.

Table 5.10: Individual dynamics classification models results (CWB = class weight balancing, SS =

subsampling)

Weather PAD Classes CWB SS Features Precision Recall F1ROC

AUC

No

Joint

K-means

(k=8)

GeneralNo All 0.172 0.212 0.181 0.513

No Best 20 0.202 0.234 0.208 0.528

SubsampleNo All 0.186 0.218 0.190 0.516

No Best 20 0.209 0.234 0.210 0.528

No

No All 0.172 0.224 0.184 0.518

No Best 20 0.212 0.251 0.219 0.537

Yes All 0.231 0.272 0.244 0.584

Yes Best 20 0.262 0.285 0.265 0.592

+/-

GeneralNo All 0.277 0.407 0.309 0.522

No Best 20 0.279 0.397 0.304 0.518

SubsampleNo All 0.256 0.390 0.293 0.511

No Best 20 0.268 0.395 0.306 0.517

No

No All 0.276 0.405 0.312 0.521

No Best 20 0.319 0.448 0.352 0.552

Yes All 0.170 0.188 0.173 0.536

Yes Best 20 0.221 0.219 0.212 0.554

Separate

K-means

(k=2)

GeneralNo All 0.471 0.389 0.404 0.544

No Best 20 0.585 0.463 0.493 0.598

SubsampleNo All 0.468 0.388 0.402 0.542

No Best 20 0.584 0.469 0.496 0.603

No

No All 0.492 0.405 0.420 0.543

No Best 20 0.561 0.461 0.488 0.593

Yes All 0.658 0.633 0.644 0.648

Yes Best 20 0.643 0.617 0.629 0.639

+/-

GeneralNo All 0.650 0.677 0.637 0.533

No Best 20 0.701 0.711 0.682 0.580

SubsampleNo All 0.656 0.673 0.634 0.533

No Best 20 0.697 0.713 0.686 0.583

No

No All 0.625 0.666 0.631 0.535

No Best 20 0.686 0.706 0.680 0.583

Yes All 0.653 0.664 0.657 0.653

Yes Best 20 0.620 0.642 0.629 0.624

+/0/-

GeneralNo All 0.508 0.595 0.533 0.519

No Best 20 0.557 0.625 0.570 0.549

SubsampleNo All 0.496 0.594 0.529 0.517

No Best 20 0.549 0.619 0.565 0.546

68


No

No All 0.514 0.591 0.537 0.523

No Best 20 0.559 0.623 0.575 0.556

Yes All 0.441 0.443 0.435 0.582

Yes Best 20 0.513 0.511 0.507 0.633

Yes

Joint

K-means

(k=8)

GeneralNo All 0.179 0.234 0.194 0.513

No Best 20 0.201 0.254 0.218 0.526

SubsampleNo All 0.168 0.226 0.186 0.506

No Best 20 0.215 0.264 0.228 0.531

No

No All 0.162 0.228 0.182 0.505

No Best 20 0.191 0.254 0.212 0.523

Yes All 0.271 0.280 0.265 0.588

Yes Best 20 0.286 0.308 0.291 0.604

+/-

GeneralNo All 0.313 0.404 0.337 0.556

No Best 20 0.298 0.379 0.321 0.540

SubsampleNo All 0.323 0.416 0.347 0.562

No Best 20 0.300 0.382 0.325 0.542

No

No All 0.314 0.407 0.336 0.559

No Best 20 0.339 0.431 0.369 0.581

Yes All 0.229 0.297 0.253 0.598

Yes Best 20 0.242 0.313 0.269 0.607

Separate

K-means

(k=2)

GeneralNo All 0.449 0.293 0.324 0.530

No Best 20 0.499 0.349 0.385 0.569

SubsampleNo All 0.469 0.296 0.330 0.532

No Best 20 0.478 0.342 0.374 0.567

No

No All 0.411 0.313 0.337 0.531

No Best 20 0.483 0.351 0.386 0.560

Yes All 0.671 0.635 0.649 0.660

Yes Best 20 0.651 0.653 0.650 0.646

+/-

GeneralNo All 0.609 0.631 0.586 0.530

No Best 20 0.632 0.649 0.619 0.573

SubsampleNo All 0.621 0.624 0.585 0.530

No Best 20 0.620 0.642 0.609 0.570

No

No All 0.570 0.631 0.586 0.529

No Best 20 0.652 0.660 0.631 0.567

Yes All 0.674 0.679 0.674 0.671

Yes Best 20 0.668 0.680 0.672 0.673

+/0/-

GeneralNo All 0.532 0.615 0.549 0.523

No Best 20 0.578 0.627 0.579 0.545

SubsampleNo All 0.538 0.618 0.550 0.524

No Best 20 0.579 0.627 0.577 0.543

No

No All 0.518 0.602 0.547 0.520

No Best 20 0.568 0.632 0.584 0.559

Yes All 0.461 0.467 0.459 0.600

Yes Best 20 0.571 0.569 0.564 0.677

69


5.2.3 General Text Content Models

To build general text content classification models, the text content feature data of all par-

ticipants is used to train a SVM for classification4. There are four degrees of freedom in

these models. The first degree of freedom is again whether a model is built for each separate

dimension of the PAD model or whether one model is built to predict the entire PAD-tuple.

The second degree of freedom is the type of class system that is used. The third degree of

freedom is the type of class weight balancing that is used. The fourth degree of freedom is

whether or not subsampling is performed. In total, 18 different SVM models for classification

were built and evaluated. The evaluation results can be found in Table 5.11. The metrics

that are presented were calculated for the entire PAD-tuple when it concerns models that use

the joint PAD space and for each dimension of the PAD space and then averaged to obtain a

result for the entire model when it concerns models that use separated PAD dimensions. The

ROC AUC score indicates that the best result is obtained with the combination of using the

joint PAD space that is divided into classes according to the k-means clusters (k = 8), no class

weight balancing and subsampling (bold). This model was built using 704 labeled samples

and achieves a ROC AUC score of 0.69 and a F1 score of 0.46 (± 0.059). The corresponding

confusion matrix is presented in Table 5.12. However, again note that all models built on

separate PAD dimensions using classes determined by the k-means clusters (with k = 2) or by

the positive/negative class definition, no class weight balancing and subsampling (underlined)

have similar ROC AUC scores to this model but also have much higher precision, recall and

F1 scores. Again one could argue that these models actually perform better.

5.2.4 Individual Text Content Models

For the individual text content classification models the same degrees of freedom as for the

general text content classification models were used to build multiple variants of the models.

They were built for the same 13 participants as in the individual text content regression

models. The evaluation results can be found in Table 5.13. The metrics that are presented

were calculated for each participant and for the entire PAD-tuple when it concerns models

that use the joint PAD space. When it concerns models that use separated PAD dimensions,

they were calculated for each dimension of the PAD space and then averaged to obtain a

result for the entire model. Again note that the results for models that use the joint PAD

space in combination with the positive/neutral/negative class definition were left out because

of the same reasons as in the individual dynamics classification models. The ROC AUC

score indicates that the best result on average is obtained with the combination of using the

separated PAD space in which the classes are defined according to the k-means clusters (with

k = 2), no class weight balancing and subsampling (bold). An average ROC AUC score of

4C-Support Vector Classification: http://scikit-learn.org/stable/modules/generated/sklearn.svm.

SVC.html

70

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html


Table 5.11: General text content classification models results (CWB = class weight balancing, SS =

subsampling)

PAD Classes CWB SS Precision Recall F1ROC

AUC

Joint

K-means

(k=8)

General No 0.268019722 0.275276126 0.266816902 0.57673234

No No 0.307464182 0.289719626 0.221705095 0.561704624

No Yes 0.45753873 0.458806818 0.456671627 0.690746753

+/-

General No 0.333963658 0.283772302 0.293695074 0.56930125

No No 0.387398865 0.400169924 0.263543947 0.525877128

No Yes 0.333055677 0.31402439 0.320677112 0.608013937

+/0/-

General No 0.253132625 0.143585387 0.149888959 0.538395204

No No 0.240808258 0.351741716 0.205233201 0.525977603

No Yes 0.037037037 0.037037037 0.037037037 0.5

Separate

K-means

(k=2)

General No 0.624411948 0.601834804 0.612769593 0.614829803

No No 0.633501525 0.624903002 0.583944684 0.57153326

No Yes 0.68553396 0.662304757 0.672002906 0.677553096

+/-

General No 0.710538281 0.70903735 0.709616053 0.615116742

No No 0.667320598 0.875335615 0.756375021 0.570957361

No Yes 0.664575639 0.692656261 0.677612159 0.671396258

+/0/-

General No 0.574958879 0.553125876 0.561861179 0.610086922

No No 0.552407486 0.596860107 0.535415923 0.568419015

No Yes 0.528597285 0.514981702 0.516663227 0.636236276

Table 5.12: Confusion matrix for best general text content classification model

1 2 3 4 5 6 7 8

1 50% 17.04% 1.12% 9.12% 10.24% 4.56% 4.56% 3.44%

2 4.72% 51.12% 2.24% 4.56% 13.6% 6.8% 11.36% 5.68%

3 5.68% 11.36% 55.68% 2.24% 6.8% 2.24% 10.24% 5.68%

4 7.92% 15.92% 11.36% 34.08% 5.68% 1.12% 13.6% 10.24%

5 10.24% 21.6% 13.6% 2.24% 34.08% 5.68% 5.68% 6.8%

6 5.68% 12.48% 4.56% 4.56% 13.6% 45.44% 7.92% 5.68%

7 3.44% 18.16% 3.44% 9.12% 6.8% 4.56% 52.24% 2.24%

8 9.12% 22.72% 6.8% 10.24% 17.04% 3.44% 3.44% 27.28%

71


0.60 and a F1 score of 0.61 (± 0.21) is achieved. This is one of the models that were found

to yield good performance as well for general text content classification models. Moreover,

also the combination of using the separated PAD space in which each axis is divided into a

positive and a negative class, no class weight balancing and subsampling (underlined) yields

very similar performance to the best model. This is again one of the models that were found

to yield good performance for general text content classification models. This class of models

also has satisfying precision, recall and F1 scores.

72


Table 5.13: Individual text content classification models results (CWB = class weight balancing, SS

= subsampling)

PAD Classes CWB SS Precision Recall F1ROC

AUC

Joint

K-means

(k=8)

General No 0.2048145 0.221690029 0.200216664 0.532035625

No No 0.144275293 0.277016936 0.177469156 0.525336029

No Yes 0.246539472 0.196759259 0.208456228 0.541005291

+/-

General No 0.273186198 0.282140001 0.264090733 0.51690944

No No 0.261486105 0.39437678 0.282478137 0.497085081

No Yes 0.095089286 0.057291667 0.067899548 0.461309524

Separate

K-means

(k=2)

General No 0.456824801 0.457858561 0.450353145 0.539618275

No No 0.385719024 0.380189637 0.357320477 0.540389041

No Yes 0.602536765 0.628500278 0.606165789 0.598757394

+/-

General No 0.622120277 0.669521918 0.641281107 0.54520671

No No 0.578284116 0.712970634 0.63220738 0.540734206

No Yes 0.586616834 0.618658291 0.596331473 0.594683574

+/0/-

General No 0.57161971 0.595287325 0.572264152 0.552319046

No No 0.546616105 0.630107456 0.561077139 0.542007658

No Yes 0.354593679 0.343848047 0.341829526 0.507886036

73

Chapter 6

Discussion

In this chapter some additional discussion and result analysis is presented in order to draw

more specific conclusions about which models perform good or bad and which model aspects

have a large or small influence on this performance. Furthermore, some extra thought is given

to other possibilities for achieving better performances.

6.1 Overall Results

The best overall results are summarized in Table 6.1. The regression models that were built

based on the dynamics features did not yield good performances in general. Both the general

and individual models yield relatively high errors on average and do not present predictions

that are significantly better than always predicting the mean value. This does not mean

that regression models cannot be built for this application. The best explanation for the

bad performances is the limited size and bias of the dataset. Possibly, a larger dataset that

contains more uniformly distributed samples will yield better regression models. Also the

models that were built using the text content features did not perform well. This can again

be explained by the limited size of the dataset. The amount of text content features that are

Table 6.1: Summary of best overall results

Model type # samples R2/

Precision

MAE/

Recall

MSE

/F1

EVS/

ROC

AUC

Regression

GeneralDynamics 630 0.177 17.66 453.38 0.177

Text content 1177 0.036 19.69 510.28 0.042

IndividualDynamics N/A 0.048 16.19 394.38 0.049

Text content N/A -0.051 17.64 456.57 -0.034

Classification

GeneralDynamics 704 0.722 0.735 0.728 0.726

Text content 704 0.686 0.662 0.672 0.678

IndividualDynamics N/A 0.658 0.633 0.644 0.648

Text content N/A 0.603 0.629 0.606 0.599

74

Chapter 6. Discussion

extracted on average is much higher than the amount of samples that are available, resulting

in a very sparse feature space. Furthermore, the quality of the text content itself is also

very limited (e.g. dialects, chat language, different languages) which again results in lower

performance.

The classification models presented much better results and a clear pattern of a family of

models that perform very well. For both general and individual models, it can be seen that

the best models are the ones that are built using separate PAD dimensions, either divided into

classes using k-means clusters or according to the positive/negative class definition, not using

class weight balancing and using subsampling. This is expected as using the entire PAD-

tuple results in a multi-class problem, which is harder to solve than three binary classification

problems resulting from using separate PAD dimensions. The fact that the models, using

k-means clustering to generate classes, perform very similar to the models using the classes

generated according to the positive/negative class definition can be explained by the value

distributions for each PAD axis. These are shown in Figure 6.1. Two clusters of values for

each PAD axis can clearly be seen, which will very likely be the same clusters that the k-means

clustering algorithm will find. Presumably, the presence of these two obvious clusters is caused

by the tendency of the participants to always move the sliders while indicating their emotional

state, resulting in the slider being either positioned to the left or right of the center but

never in the center itself. These two clusters align very closely to the positive/negative class

definition which means that these two ways of defining classes will yield very similar classes.

Since the produced classes are very similar, the results will also be very similar. Indeed,

Table 6.2 presents the clusters generated by the k-means algorithm (with k = 2) and shows

that the generated classes are very similar to the ones generated by the positive/negative class

definition. One of the main reasons for the better performance of classification models can

probably be attributed to the subsampling technique, solving the class skew problem. This

was one of the main limitations for the regression models. A similar intelligent technique

could maybe be applied to the continuous dataset as well to make the regression models

perform better as well.

Table 6.2: Clusters generated by the 2-means algorithm for separated PAD dimensions

PAD dimension Cluster centroid # samples

Pleasure30.61238532 436

71.90418354 741

Arousal28.89059501 521

72.98170732 656

Dominance32.20266667 375

73.7319202 802

75


0-10

10-2

0

20-3

0

30-4

0

40-5

0

50-6

0

60-7

0

70-8

0

80-9

0

90-1

000

50

100

150

200

250

Pleasure

Am

ount

(a) Distribution of pleasure axis values

0-10

10-2

0

20-3

0

30-4

0

40-5

0

50-6

0

60-7

0

70-8

0

80-9

0

90-1

000

50

100

150

200

250

Arousal

Am

ount

(b) Distribution of arousal axis values

0-10

10-2

0

20-3

0

30-4

0

40-5

0

50-6

0

60-7

0

70-8

0

80-9

0

90-1

000

50

100

150

200

250

Dominance

Am

ount

(c) Distribution of dominance axis values

Figure 6.1: Distributions of PAD axis values

76


The effects of applying class weight balancing in the classification models were negligible,

indicating that this technique does not seem to work. Finally, the models built using text

content features do not perform very well compared to the models built using dynamics

features. An important reason for this is probably the fact that the text content contains

multiple languages and, more importantly, dialects and so-called chat language. This results

in many different ways of spelling the same word. Furthermore, there is no real detection

of emotionally charged words, which could help improving the performance of text content

models.

6.2 Contextual Data

In this research contextual data was used in the form of weather information. It seems that

features concerning weather information (temperature, pressure, humidity and discomfort

index) do not make much of a difference for model performances. However, for regression

models there is a pattern indicating that weather features might be useful. One of the main

problems is that including weather data causes a lot of samples to be removed, which in turn

causes a decrease in model performance due to a limited dataset. At first sight, one might

draw the conclusion that weather data does not provide discriminative information to the

model. However, in classification, the decrease of performance, due to the removal of samples

that do not contain weather data, is much smaller and almost non-existent, indicating that

weather features might provide useful data. Thus, due to the limited size of the dataset, the

effects of using contextual data are obfuscated. Further research using larger datasets might

give more insight into the value of contextual data for emotion recognition.

6.3 Feature Selection

Using feature selection for models does not have a significant impact on the results. This is

consistent with the feature importances for the best general dynamics regression and classi-

fication models provided by the Random Forest algorithm shown in Figure 6.2.

6.4 Subsampling

In the previous chapter it was noticed that subsampling improves the classification model

results significantly (almost 20%). This is because it solves the class skew problem. How-

ever, the downside of subsampling is that a lot of data gets lost and is simply not used in

these models. To illustrate that the subsampling approach is a valid one, a technique called

bootstrap subsampling can be used. Bootstrap subsampling builds multiple bootstrap models

using all data of the least represented class (as is done in normal subsampling as well) and

each bootstrap model using a different random subset of the data in the other classes. The

77


average of the scores of each bootstrap model is then the score for the entire model. It can

provide a more reliable way of evaluation as in this technique more data is used. Table 6.3

presents the results of bootstrap subsampling applied to the general dynamics classification

model that combines not including weather data, using separate PAD dimensions with classes

defined by the k-means clusters (with k = 2), no class weight balancing, subsampling and no

feature selection. The downside of this technique is that it is very time-consuming. Therefore

it was not applied to all models that were built using subsampling.

6.5 Prediction Performance for PAD-dimensions

From the general dynamics regression results for models that predict for separate PAD dimen-

sions can be concluded that the dominance dimension can be predicted more accurately than

the pleasure and arousal dimension, according to the R2 scores and MAE values. However, it

is important here to again take the value distribution of this axis into account (see Figure 6.1).

This distribution clearly shows that there is a large bias for the dominance dimension, which

explains the lower PAD values. However, this does not explain the fact that also the R2

scores are higher for the dominance dimension in comparison to the other dimensions. This

fact cannot be explained by the bias on the value distribution as this score is relative to always

taking the mean as a prediction. The mean value of the dominance dimension is also affected

when the value distribution is biased and thus this bias does not result in a higher R2 score,

meaning that the features are more informative for the dominance dimension compared to

the other dimensions.

The same phenomenon is observed in the general dynamics classification results for models

that use separate PAD dimensions. The dominance dimension achieves significantly higher

precision, recall and F1 scores compared to the other dimensions. The ROC AUC scores for

the dominance dimension are directly proportional to those for the other dimensions.

6.6 Number of Samples

The model performance stays stable when the number of training samples is changed for

general dynamics regression models. Table 6.4 contains the results for the general dynamics

regression model that combines including weather data, using the joint PAD space and using

feature selection with various training set sizes. This is the case as well when the individual

model results from different participants are compared. Also general dynamics classification

model performances stay stable when changing the number of training samples. Table 6.5

contains the results for various training set sizes for the general dynamics classification model

that combinines not including weather data, using separate PAD dimensions with classes

defined by the k-means clusters (with k = 2), no class weight balancing, subsampling and no

78


Table 6.3: Bootstrap subsampling results for general dynamics classification model

Round Precision Recall F1 ROC AUC

1 0.717585912 0.726206609 0.72175615 0.720000712

2 0.707315101 0.723823526 0.71546539 0.712108198

3 0.718066776 0.739734021 0.728667081 0.724471576

4 0.718576649 0.723644485 0.720651117 0.719759138

5 0.71464299 0.726986909 0.720491328 0.71761579

6 0.729068549 0.741013612 0.734839796 0.732444988

7 0.72155289 0.735463468 0.728329377 0.72554445

8 0.710272011 0.719003418 0.714465444 0.712470473

9 0.724458701 0.72897929 0.7261802 0.725351842

10 0.729446734 0.742026128 0.735546113 0.733145494

Average 0.719098631 0.730688147 0.7246392 0.722291266

feature selection. The same holds for the individual models. Thus, it can be concluded that

the number of samples does not have a significant influence on the model performance but it

is rather the sample value distributions that influence the results.

6.7 Model Hyperparameters

By optimizing the model hyperparameters, the results can be improved even more. For the

dynamics regression models, it was found that the best results are obtained using the random

forest containing the original 500 trees but using a maximum number of features used per tree

that is equal to the total number of available features, instead of the square root of the total

number of available features. For the dynamics classfication models, it was found that the

best results are obtained using the random forest containing the original 500 trees but using

a maximum number of features used per tree that is equal to the binary logarithm of the

total number of available features, instead of the square root of the total number of available

features. The tables containing the results of the optimization processes can be found in

Appendix E.

Table 6.4: Results for general dynamics regression model with various training set sizes

Nr of samples R2 MAE MSE EVS

100 0.22524175 13.6874 296.6768386 0.225479194

200 0.200609351 13.97993333 301.8228383 0.200895355

300 0.164304591 15.50327333 374.3565912 0.164358175

400 0.207546542 15.77119833 377.2240982 0.207925006

500 0.20879954 16.859352 413.9270148 0.209297264

600 0.174548003 17.16563111 428.6194474 0.175037163

79


Table 6.5: Results for general dynamics classification model with various training set sizes

Nr of samples Precision Recall F1 ROC AUC

100 0.673879142 0.674569402 0.672268673 0.669756839

200 0.707781439 0.656543907 0.680428146 0.691684817

300 0.741407802 0.733825815 0.736972541 0.737031076

400 0.735368727 0.70652697 0.719417635 0.725463714

500 0.71523112 0.708420543 0.711343485 0.714561319

600 0.698434725 0.71025146 0.704286078 0.701755464

700 0.711076088 0.725112248 0.716872164 0.712789109

800 0.703133438 0.720240645 0.711293069 0.706909115

6.8 Clustered Models

Both general models, that do not distinguish between different persons, and individual models,

that are built only using data of one person, were built. Except for these types of models,

also clustered models could be built. The concept of clustered models is, just like individual

models, also based on the assumption that every person is unique. However, it is assumed

that it is possible that there exists a finite number of clusters or user profiles, that comprise

people that show similar computer behaviour for each emotional state. A clustered model

will first try to define these profiles and then try to assign each person to some profile. Next,

it will build one model for each profile using only data from persons assigned to these profiles.

These profiles can be detected based on similar personality characteristics using a standard

clustering algorithm such as k-means clustering. To do this, more participants are needed

and more extensive personality questionnaires should be taken from all participants.

6.9 Dataset

We have prepared the raw dataset that was used in this research for future use. Therefore,

some precautions needed to be taken. During the data collection process, all keystroke,

mouse movement and location data of each participant were collected. This data has already

been made anonymous by assigning specific irreversible identifiers to each participant. Some

corresponding personal information, such as the participant’s name, email and place of birth

will be discarded. Furthermore, as all keystrokes were collected, the data also contains email

addresses, passwords and other potentially sensitive information that should not be made

publicly available. Editing this data manually would be an infeasible job. Therefore, we

opted to apply a substitution code to the alphabetic data. This way, the actual textual

content becomes unreadable but all of the dynamics features still make sense. Corresponding

keydown and keyup pairs can still be found and a distinction between alphabetic, numerical

and special characters still exists. To maintain the hierarchical structure of the entire dataset,

80


we opted to make it available in a JSON format. Note that the dataset contains raw data

and does not provide pre-calculated features.

81


(a) Regression

(b) Pleasure classification

82


(c) Arousal classification

(d) Dominance classification

Figure 6.2: Feature importances for best general dynamics models

83

Chapter 7

Conclusion

The techniques that are currently available for recognizing emotions of computers are often

expensive, intrusive and therefore not really suited for application in home or office envi-

ronments. The real-world application possibilities for affective computing solutions are thus

limited by different factors.

This research covered a solution that analyzes users’ keystroke dynamics, textual content,

mouse movements and some contextual factors to recognize their emotional state. This solu-

tion overcomes many of the limitations of current techniques. It is a non-intrusive solution as

the user is not aware of the ongoing monitoring process. Furthermore, this solution only uses

inexpensive and widely used computer equipment. This makes the technique very interesting

to use for real-world application of affective computing solutions in present-day computer

systems.

Data collection software was developed and used to perform a field study. In this field study,

an experience sampling methodology was used to collect samples, in which participants had

to report their emotional state on the PAD dimensional emotion model. These reports were

taken as ground truth values and features were extracted from the raw monitoring data.

These ground truth values were then used to build emotional recognition models based on

the extracted features.

7.1 Lessons Learned

One of the main limitations for this research was the reliance on the participants’ self-reports

to obtain the ground truth values. This has the disadvantage that is entirely possible that

participants provide incorrect emotional state information, either intentional or not. However,

there is no objective method for emotion identification that lends itself to the experience

sampling methodology. Furthermore, the usage of this methodology requires participants to

have a good understanding of the emotion model that is used to indicate their emotional

84

Chapter 7. Conclusion

states. The PAD model has the disadvantage that the meaning of the different dimensions is

not always clear without proper explanation. Of course, there are also advantages to using

the experience sampling methodology. It allows for an easy remote data collection process

over long periods of time. Furthermore, participants are not influenced by the data collection

process, as is the case in laboratory settings.

Another problem that was observed in this research is the class skew problem. This problem

is again caused by the usage of the experience sampling methodology as this methodology

does not allow for inducing emotions in participants such that a dataset, that is uniformly

distributed over the PAD space, cannot be obtained without manipulation. Classification

appeared to be easier than regression. This is mainly due to this class skew and the size

of the dataset. Classification allows for subsampling, solving the class skew problem, which

improves the model perfomance significantly. However, subsampling also has the disadvantage

of removing a lot of useful data. To allow all data to be used, bootstrap subsampling can be

used.

In general, the dataset that was used in this research was not very big. More training

samples would first of all, result in a better coverage of the PAD space and secondly, result

in less dramatic effects of performing subsampling. Individual models were built as well but

this proved to be difficult due to an insufficient amount of samples per person and also the

manifestation of the class skew problem on the individual level.

7.2 Contributions

This section presents a brief summary of the main contributions in this research.

First, the pleasure-arousal-dominance emotion model was used providing a more discrimina-

tive way of defining emotions. This emotion model also allows for regression techniques to

be applied to the emotion recognition task. In other studies, either discrete emotion mod-

els (based on classes), one-dimensional models (such as PANAS) or two-dimensional models

(such as the PA model) were used, which are all emotion models that can be ambiguous for

some emotional states.

Second, free text data was used as a basis for the models in this research in contrast to many

other studies that used fixed text. Although a free text approach is a very difficult approach,

the potential for real-world application is much bigger for free text as it allows users to work

without restraint while still collecting information about their emotional state.

Third, models were built based on a combination of dynamics (both keystroke and mouse)

features and contextual features (i.e. weather data) and compared to investigate the added

85


value of contextual features. To allow weather data to be used, an easy workflow was pro-

vided that only depends on the user’s location services settings and an internet connection.

This research has shown that weather data may contain valuable information for emotion

recognition but further examination is required to confirm this.

Fourth, both general models and individual models were built to investigate the possible

advantages of using more user-specific models to make accurate predictions. The results in

this research for individual models are promising but more data and especially more uniformly

distributed data is required to be able to draw more reliable conclusions on this matter.

Fifth, a methodology was presented to use regression models in more practical settings using

fuzzy logic. This methodology also allows for less accurate predictions to be used, as the

inherent fuzziness and uncertainty can be easily handled using fuzzy rules.

Sixth, good classification models were created for predicting two class-levels on each separate

dimension of the PAD model based on dynamics features. The best model achieved a classi-

fication accuracy of 0.73, a precision of 0.72, a recall of 0.74, a F1 score of 0.73 and a ROC

AUC score of 0.73.

Finally, an easy-to-use dataset was constructed with a hierarchical structure and prepared for

public use. One of the main problems in this research area is that most studies use a different

dataset, different metrics, different features and/or different emotion models. This causes a

lot of difficulties when comparing findings and results. By allowing our dataset to be used in

other studies, the problem of the usage of different datasets and emotion models is solved.

7.3 Potential for Application

This study presents classification models that are suitable for real-world application. Re-

gression models would be even more interesting and specific but solutions that are suited for

practical use could not be created yet. The real-world application of these techniques also

impose some ethical concerns for privacy, as the monitoring process that is used does not get

noticed by the user. The collected data may very well serve as a way to recognize a user’s af-

fective state but it may also cause a lot of harm when the data falls into the wrong hands as it

may contain passwords and other sensitive information. Notifying the user of the background

monitoring process allows for a consent but counteracts the goal of the concealment.

7.4 Future Work

To further improve the models based on the presented techniques, bigger and more uniformly

distributed datasets need to be created. A dataset does not necessarily have to be distributed

86


uniformly. When the dataset contains enough data for each class, subsampling can be ap-

plied to make the dataset uniformly distributed. There also exist alternative techniques to

subsampling that can be investigated. One such alternative is oversampling. However, this

has the advantage that all available data is used but has the disadvantage that it may lead

to overfitting on the specific dataset.

It would also be interesting to examine the possibility of using the clustered models that were

proposed. Therefore, a lot more participants are needed such that clusters of participants can

be made and models can be built for each cluster and then compared to observe if there is

any improvement.

In this research, all samples were assumed to be mutually independent. However, when

observing samples at an individual level, it makes sense to assume that emotional states in

these samples depend on each other. For example, a person who is exuberant at time t is

more likely to be relaxed at time t+1 than to be bored. Therefore, it is interesting to explore

the use of time series on an individual level.

Furthermore, other features can be used as a replacement or in addition to the features used

in this study. Possible features might be the sex of the user, whether the user is right or left

handed, the keyboard layout of the user, application context and typing proficiency. Note

that this information is already available in the current dataset. It may also be interesting to

investigate other techniques for outlier removal in order to reduce the noise in the dataset.

As was mentioned before, one of the limitations of this study was the reliance on the partici-

pants for establishing the ground truth. To overcome this, other techniques may be used for

identifying the participants emotional state such as facial emotion recognition using a camera.

Finally, the usage of textual content can be improved by detecting emotionally charged words.

This can be done using a hand-crafted dictionary of emotionally charged words. Also text

normalization and stemming techniques might be used in order to optimize the text inter-

pretation process and to allow the handling of dialects. It can also be interesting to perform

language detection and then use the corresponding processing settings for that language.

However, note that the dataset that is provided has been encrypted and textual content anal-

ysis is not possible or would not make any sense. Another approach to text content analysis

would be to collect an actual set of texts (e.g. posts on social media) and apply machine

learning techniques to this dataset.

A number of techniques for recognizing a user’s emotional state based on typing behaviour,

mouse movements, contextual information and textual content were presented and explored.

The results in this research show that it is possible to do this using hardware that is commonly

available today which makes the solution particularly suitable for real-world application.

87

Bibliography

[1] Artificial neural network illustration. https://en.wikipedia.org/wiki/Artificial_

neural_network.

[2] Bayesian network example. http://www.ra.cs.uni-tuebingen.de/software/JCell/

tutorial/ch03s03.html.

[3] Decision tree example. https://alliance.seas.upenn.edu/~cis520/wiki/index.

php?n=Lectures.DecisionTrees.

[4] Em-algorithm example. http://www.bobnirma.com/tag/expectation-maximization.

[5] k-means clustering example. https://apandre.wordpress.com/visible-data/

cluster-analysis/.

[6] k-nn classification illustration. http://bdewilde.github.io/blog/blogger/2012/10/

26/classification-of-hand-written-digits-3/.

[7] Kernel trick illustration. http://www.eric-kim.net/eric-kim-net/posts/1/kernel_

trick.html.

[8] Lovheim cube of emotion. https://en.wikipedia.org/wiki/Emotion_

classification.

[9] Machine learning. https://en.wikipedia.org/wiki/Machine_learning.

[10] Neuron illustration. https://en.wikibooks.org/wiki/Artificial_Neural_

Networks/Print_Version.

[11] Random forest classification illustration. http://file.scirp.org/Html/6-9101686_

31887.htm.

[12] Support vector machine classification illustration. https://www.dtreg.com/solution/

view/20.

[13] Areej Alhothali. Modeling user affect using interaction events. 2011.

88

https://en.wikipedia.org/wiki/Artificial_neural_network

https://en.wikipedia.org/wiki/Artificial_neural_network

http://www.ra.cs.uni-tuebingen.de/software/JCell/tutorial/ch03s03.html

http://www.ra.cs.uni-tuebingen.de/software/JCell/tutorial/ch03s03.html

https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.DecisionTrees

https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Lectures.DecisionTrees

http://www.bobnirma.com/tag/expectation-maximization

https://apandre.wordpress.com/visible-data/cluster-analysis/

https://apandre.wordpress.com/visible-data/cluster-analysis/

http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/

http://bdewilde.github.io/blog/blogger/2012/10/26/classification-of-hand-written-digits-3/

http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html

http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html

https://en.wikipedia.org/wiki/Emotion_classification

https://en.wikipedia.org/wiki/Emotion_classification

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Print_Version

https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Print_Version

http://file.scirp.org/Html/6-9101686_31887.htm

http://file.scirp.org/Html/6-9101686_31887.htm

https://www.dtreg.com/solution/view/20

https://www.dtreg.com/solution/view/20

Bibliography

[14] Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat. Emotions from text: machine

learning for text-based emotion prediction. In Proceedings of the conference on Human

Language Technology and Empirical Methods in Natural Language Processing, pages 579–

586. Association for Computational Linguistics, 2005.

[15] Kaveh Bakhtiyari and Hafizah Husain. Fuzzy model in human emotions recognition.

arXiv preprint arXiv:1407.1474, 2014.

[16] Kaveh Bakhtiyari and Hafizah Husain. Fuzzy model of dominance emotions in affective

computing. Neural Computing and Applications, 25(6):1467–1477, 2014.

[17] Kaveh Bakhtiyari, Mona Taghavi, and Hafizah Husain. Implementation of emotional-

aware computer systems using typical input devices. In Intelligent Information and

Database Systems, pages 364–374. Springer, 2014.

[18] Margaret M Bradley and Peter J Lang. Affective norms for english words (anew): In-

struction manual and affective ratings. Technical report, Technical Report C-1, The

Center for Research in Psychophysiology, University of Florida, 1999.

[19] Joost Broekens. In defense of dominance: Pad usage in computational representations

of affect. International Journal of Synthetic Emotions (IJSE), 3(1):33–42, 2012.

[20] Gerald L Clore, Andrew Ortony, and Mark A Foss. The psychological foundations of the

affective lexicon. Journal of personality and social psychology, 53(4):751, 1987.

[21] Paul Ekman. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200,

1992.

[22] Paul Ekman and Wallace V Friesen. Constants across cultures in the face and emotion.

Journal of personality and social psychology, 17(2):124, 1971.

[23] Clayton Epp, Michael Lippold, and Regan L Mandryk. Identifying emotional states

using keystroke dynamics. In Proceedings of the SIGCHI Conference on Human Factors

in Computing Systems, pages 715–724. ACM, 2011.

[24] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical resource

for opinion mining. In Proceedings of LREC, volume 6, pages 417–422. Citeseer, 2006.

[25] Michael Fairhurst, Da Costa-Abreu, et al. Using keystroke dynamics for gender identi-

fication in social network environment. In Imaging for Crime Detection and Prevention

2011 (ICDP 2011), 4th International Conference on, pages 1–6. IET, 2011.

[26] Joseph P Forgas. Mood and judgment: the affect infusion model (aim). Psychological

bulletin, 117(1):39, 1995.

89

Bibliography

[27] R Stockton Gaines, William Lisowski, S James Press, and Norman Shapiro. Authentica-

tion by keystroke timing: Some preliminary results. Technical report, DTIC Document,

1980.

[28] Hatice Gunes, Bjorn Schuller, Maja Pantic, and Roddy Cowie. Emotion representation,

analysis and synthesis in continuous space: A survey. In Automatic Face & Gesture

Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, pages

827–834. IEEE, 2011.

[29] Joel M Hektner, Jennifer A Schmidt, and Mihaly Csikszentmihalyi. Experience sampling

method: Measuring the quality of everyday life. Sage, 2007.

[30] Amaury Hernandez-Aguila, Mario Garcia-Valdez, and Alejandra Mancilla. Affective

states in software programming: Classification of individuals based on their keystroke

and mouse dynamics. Intelligent Learning Environments, page 27, 2014.

[31] Holger Hoffmann, Andreas Scheck, Timo Schuster, Steffen Walter, Kerstin Limbrecht,

Harald C Traue, and Henrik Kessler. Mapping discrete emotions into the dimensional

space: An empirical approach. In Systems, Man, and Cybernetics (SMC), 2012 IEEE

International Conference on, pages 3316–3320. IEEE, 2012.

[32] Edgar Howarth and Michael S Hoffman. A multidimensional approach to the relationship

between mood and weather. British Journal of Psychology, 75(1):15–23, 1984.

[33] Carroll E Izard. Four systems for emotion activation: cognitive and noncognitive pro-

cesses. Psychological review, 100(1):68, 1993.

[34] Philip Nicholas Johnson-Laird and Keith Oatley. The language of emotions: An analysis

of a semantic field. Cognition and emotion, 3(2):81–123, 1989.

[35] Christian Kaernbach. On dimensions in emotion psychology. In Automatic Face &

Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference

on, pages 792–796. IEEE, 2011.

[36] A Ko lakowska, A Landowska, M Szwoch, W Szwoch, and MR Wrobel. Emotion recog-

nition and its applications. In Human-Computer Systems Interaction: Backgrounds and

Applications 3, pages 51–62. Springer, 2014.

[37] Agata Kolakowska. A review of emotion recognition methods based on keystroke dy-

namics and mouse movements. In Human System Interaction (HSI), 2013 The 6th In-

ternational Conference on, pages 548–555. IEEE, 2013.

[38] Agata Kolakowska. Recognizing emotions on the basis of keystroke dynamics. In Human

System Interactions (HSI), 2015 8th International Conference on, pages 291–297. IEEE,

2015.

90

Bibliography

[39] James D Laird. Self-attribution of emotion: the effects of expressive behavior on the

quality of emotional experience. Journal of personality and social psychology, 29(4):475,

1974.

[40] Peter J Lang, Mark K Greenwald, Margaret M Bradley, and Alfons O Hamm. Looking at

pictures: Affective, facial, visceral, and behavioral reactions. Psychophysiology, 30:261–

261, 1993.

[41] J LeDoux. Emotion circuits in the brain. 2003.

[42] Hosub Lee, Young Sang Choi, Sunjae Lee, and IP Park. Towards unobtrusive emo-

tion recognition for affective social communication. In Consumer Communications and

Networking Conference (CCNC), 2012 IEEE, pages 260–264. IEEE, 2012.

[43] Po-Ming Lee, Wei-Hsuan Tsui, and Tzu-Chien Hsiao. The influence of emotion on

keyboard typing: an experimental study using visual stimuli. Biomedical engineering

online, 13(1):81, 2014.

[44] Robert LiKamWa, Yunxin Liu, Nicholas D Lane, and Lin Zhong. Can your smartphone

infer your mood. In PhoneSense workshop, pages 1–5, 2011.

[45] Robert LiKamWa, Yunxin Liu, Nicholas D Lane, and Lin Zhong. Moodscope: Building

a mood sensor from smartphone usage patterns. In Proceeding of the 11th annual inter-

national conference on Mobile systems, applications, and services, pages 389–402. ACM,

2013.

[46] Hugo Lovheim. A new three-dimensional model for emotions and monoamine neuro-

transmitters. Medical hypotheses, 78(2):341–348, 2012.

[47] Maryanne Martin. On the induction of mood. Clinical Psychology Review, 10(6):669–697,

1990.

[48] Albert Mehrabian. Basic dimensions for a general psychological theory implications for

personality, social, environmental, and developmental studies. 1980.

[49] George Miller and Christiane Fellbaum. Wordnet: An electronic lexical database, 1998.

[50] Fabian Monrose and Aviel Rubin. Authentication via keystroke dynamics. In Proceedings

of the 4th ACM conference on Computer and communications security, pages 48–56.

ACM, 1997.

[51] Fabian Monrose and Aviel D Rubin. Keystroke dynamics as a biometric for authentica-

tion. Future Generation computer systems, 16(4):351–359, 2000.

91

Bibliography

[52] AFM Nazmul Haque Nahin, Jawad Mohammad Alam, Hasan Mahmud, and Kamrul

Hasan. Identifying emotion by keystroke dynamics and text pattern analysis. Behaviour

& Information Technology, 33(9):987–996, 2014.

[53] Rosalind W Picard. Affective Computing. MIT Press, 2000.

[54] Maja Pusara and Carla E Brodley. User re-authentication via mouse movements. In

Proceedings of the 2004 ACM workshop on Visualization and data mining for computer

security, pages 1–8. ACM, 2004.

[55] Ira J Roseman and Craig A Smith. Appraisal theory: Overview, assumptions, varieties,

controversies. 2001.

[56] James A Russell. A circumplex model of affect. Journal of personality and social psy-

chology, 39(6):1161, 1980.

[57] Sergio Salmeron-Majadas, Olga C Santos, and Jesus G Boticario. Exploring indicators

from keyboard and mouse interactions to predict the user affective state. In Educational

Data Mining 2014, 2014.

[58] Pragya Shukla and Rinky Solanki. Web based keystroke dynamics application for identi-

fying emotional state. Internation journal of advanced research in computer science and

communication engineering, 2(11):4489–4493, 2013.

[59] Elizabeth Stapel. Box-and-whisker plots: Interquartile ranges and outliers, 1998.

[60] Carlo Strapparava, Alessandro Valitutti, et al. Wordnet affect: an affective extension of

wordnet. In LREC, volume 4, pages 1083–1086, 2004.

[61] Issa Traore et al. Biometric recognition based on free-text keystroke dynamics. Cyber-

netics, IEEE Transactions on, 44(4):458–472, 2014.

[62] G Tsoulouhas, D Georgiou, and A Karakos. Detection of learner’s affective state based

on mouse movements. J. Comput, 3:9–18, 2011.

[63] Wei-Hsuan Tsui, Poming Lee, and Tzu-Chien Hsiao. The effect of emotion on keystroke:

an experimental study using facial feedback hypothesis. In Engineering in Medicine and

Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE, pages

2870–2873. IEEE, 2013.

[64] Lisa M Vizer, Lina Zhou, and Andrew Sears. Automated stress detection using keystroke

and linguistic features: An exploratory study. International Journal of Human-Computer

Studies, 67(10):870–886, 2009.

92

Bibliography

[65] David Watson, Lee A Clark, and Auke Tellegen. Development and validation of brief

measures of positive and negative affect: the panas scales. Journal of personality and

social psychology, 54(6):1063, 1988.

[66] Rainer Westermann, Kordelia Spies, Gunter Stahl, and Friedrich W Hesse. Relative

effectiveness and validity of mood induction procedures: a meta-analysis. European

Journal of Social Psychology, 26(4):557–580, 1996.

[67] Lotfi A Zadeh. Soft computing and fuzzy logic. IEEE software, 11(6):48, 1994.

[68] Lina Zhou, Judee K Burgoon, Jay F Nunamaker, and Doug Twitchell. Automating

linguistics-based cues for detecting deception in text-based asynchronous computer-

mediated communications. Group decision and negotiation, 13(1):81–106, 2004.

[69] P Zimmermann, Patrick Gomez, Brigitta Danuser, and S Schar. Extending usability:

putting affect into the user-experience. Proceedings of NordiCHI’06, pages 27–32, 2006.

[70] Philippe Zimmermann, Sissel Guttormsen, Brigitta Danuser, and Patrick Gomez. Affec-

tive computing – a rationale for measuring mood with mouse and keyboard. International

journal of occupational safety and ergonomics, 9(4):539–551, 2003.

93

Appendix A

Concept Matrix of Related Work

In this chapter, a concept matrix is presented (Table A.1) containing an overview of all related

literature on emotion recognition in computer systems.

Table A.1: Overview of related literature on emotion recognition in computer systems

Reference Emotion

model

Emotion

elicitation

method

Source of

features

Data la-

beling

method

Method Results

[69] PA emo-

tion model

video clips keystrokes,

mouse

movements

question-

naire (self-

assessment

manikin)

statistical

analysis

conclusion:

possible

discrim-

ination

between

neutral

category

and four

others

94

Appendix A. Concept Matrix of Related Work

[64] physical

stress,

cognitive

stress

tasks in-

ducing

physical

and cogni-

tive stress

keystrokes,

language

parameters

question-

naire (11-

point Lik-

ert scale)

decision

trees,

SVM, k -

NN, Ad-

aBoost,

ANN

75% ac-

curacy

for cogni-

tive stress

(k -NN),

62.5% for

physical

stress (Ad-

aBoost,

SVM,

ANN);

conclusion:

number of

mistakes

during

typing de-

creases un-

der stress

[23] anger,

boredom,

confidence,

distraction,

excitement,

focused,

frustration,

happiness,

hesitance,

nervous-

ness, over-

whelmed,

relaxation,

sadness,

stress,

tired

none keystrokes question-

naire (5-

point Lik-

ert scale

filled from

time to

time)

C4.5 deci-

sion tree

(binary

classifica-

tion for

each emo-

tion)

accuracy

of 77.4%-

87.8% for

confidence,

hesitance,

nervous-

ness, relax-

ation, sad-

ness and

tiredness

95


[13] confusion,

delight,

boredom,

frustration,

neutral

none keystrokes question-

naire (5-

point Lik-

ert scale)

k -NN,

ANN,

Bayesian

network,

naive

Bayes

accuracy

of 82% for

emotional

valence

classfica-

tion; 53%

for specific

emotions

[62] boredom none mouse

move-

ments,

type of

learning

object

asking

whether

they are

bored

C4.5 deci-

sion tree

accuracy of

more than

90%

[44, 45] PA emo-

tion model

none keystrokes,

smart-

phone us-

age data

question-

naire

multi-

linear re-

gression

conclusion:

individ-

ual fea-

ture subset

increases

accuracy

[42] happiness,

surprise,

anger, dis-

gust, sad-

ness, fear,

neutral

none keystrokes,

smart-

phone sen-

sors, dis-

comfort in-

dex, loca-

tion, time,

weather

messages

reporting

emotional

state

Bayesian

network

continu-

ously up-

dated with

new data

67.52% on

average;

best ac-

curacy for

happiness,

surprise

and neu-

tral

[63] happy, un-

happy

facial feed-

back

keystrokes none statistical

analysis

conclusion:

emotional

states can

be derived

from key-

board be-

haviour

96


[52] joy, fear,

anger,

sadness,

disgust,

shame,

guilt, tired,

neutral

none keystrokes,

language

parameters

question-

naire filled

in from

time to

time

simple

logistics,

SMO, mul-

tilayer per-

ceptron,

random

tree, C4.5,

BF tree

accuracies

of 70%-

88% for

fixed text

and 64%-

82% for

free text

[30] boredom,

frustration

none keystrokes,

mouse

movements

question-

naire

k -NN clas-

sifier for

each emo-

tional state

accuracies

of 83% for

boredom

and 74%

for frustra-

tion

[43] PA emo-

tion model

pictures

from IAPS

keystrokes question-

naire (self-

assessment

manikin)

statistical

analysis

conclusion:

effect of

emotion

is signif-

icant in

keystroke

duration,

latency

and ac-

curacy

rate; size of

emotional

effect is

small com-

pared to

individual

variability

[15, 16, 17] joy, antic-

ipation,

anger, dis-

gust, sad-

ness, sur-

prise, fear,

acceptance

none keystrokes,

mouse,

touch-

screen in-

teractions

ques-

tionnaire

(PANAS)

statisti-

cal analy-

sis, SVM,

ANN

accuracy

increase

of 5% us-

ing fuzzy

model

97

Appendix B

Participant Registration Form

Figure B.1

98

Appendix C

Participant Consent Form

Figure C.1

99

Appendix D

Software Download Page

Figure D.1

100

Appendix E

Model Hyperparameters

E.1 Regression

Table E.1 contains the results for the hyperparameter optimization process for the general

dynamics regression model that combines including weather data, using the joint PAD space

and using feature selection.

E.2 Classification

Table E.2 contains the results for the hyperparameter optimization process for the general

dynamics classification model that combines not including weather data, using separate PAD

dimensions with classes defined by the k-means clusters (with k = 2), no class weight balanc-

ing, subsampling and no feature selection.

101

Appendix E. Model Hyperparameters

Table E.1: Hyperparameter optimization for general dynamics regression model using grid search

Nr of trees Max nr features R2 MAE MSE EVS

100 sqrt 0.178650307 17.6086127 452.4303151 0.178966012

200 sqrt 0.175740189 17.70640952 453.9273299 0.175937216

300 sqrt 0.17683532 17.67471323 453.2118329 0.176943436

400 sqrt 0.177004813 17.69918836 453.1496681 0.177221663

500 sqrt 0.176618405 17.66490053 453.380924 0.17674802

600 sqrt 0.176685988 17.67822328 453.3584712 0.176828395

700 sqrt 0.176587233 17.66616614 453.4823105 0.176753282

800 sqrt 0.177313162 17.65592698 453.1341193 0.177451509

100 all 0.190923224 17.59279471 445.4954829 0.191133033

200 all 0.189904986 17.53400529 445.840431 0.190069898

300 all 0.191422213 17.57060529 445.0814295 0.191581371

400 all 0.188117428 17.60969947 446.7493161 0.18827938

500 all 0.19199562 17.57228254 444.6736571 0.1921756

600 all 0.185808124 17.62428254 448.0776765 0.185988301

700 all 0.184156946 17.62998095 448.8321779 0.18431094

800 all 0.187177941 17.57182011 447.439742 0.187301759

100 log2 0.17977641 17.67698624 451.8136086 0.179974733

200 log2 0.181768339 17.58517037 450.7620542 0.18196621

300 log2 0.174884347 17.67004444 454.578436 0.175064606

400 log2 0.177525459 17.6417418 453.088688 0.177756924

500 log2 0.178866185 17.59496296 452.3918334 0.179027812

600 log2 0.175404284 17.65729101 454.1455744 0.175656324

700 log2 0.172792224 17.66793862 455.624078 0.172939608

800 log2 0.174905998 17.6841619 454.3912707 0.175093419

102

Appendix E. Model Hyperparameters

Table E.2: Hyperparameter optimization for general dynamics classification model using grid search

Nr of trees Max nr samples Precision Recall F1 ROC AUC

100 sqrt 0.71686927 0.707111808 0.711644788 0.713466663

200 sqrt 0.704371482 0.720373617 0.712215979 0.708925363

300 sqrt 0.728418536 0.735605812 0.731968255 0.730637864

400 sqrt 0.70607313 0.724677925 0.715033069 0.711031661

500 sqrt 0.724117413 0.729862325 0.726893421 0.725434758

600 sqrt 0.714933715 0.708710208 0.711795339 0.713038544

700 sqrt 0.719238781 0.732778821 0.725786337 0.723117703

800 sqrt 0.71267875 0.705886159 0.709097055 0.71023027

100 auto 0.717886841 0.722932798 0.720260402 0.719083396

200 auto 0.718318603 0.731641942 0.724828283 0.722042637

300 auto 0.726802884 0.728300622 0.72745952 0.72699334

400 auto 0.718379264 0.728124523 0.723001153 0.720913997

500 auto 0.71146811 0.729440075 0.720263543 0.716470289

600 auto 0.702790135 0.715592413 0.709103894 0.706543384

700 auto 0.708831857 0.719343856 0.713965292 0.711662466

800 auto 0.714707523 0.73272966 0.723398413 0.719528025

100 log2 0.720422463 0.721104756 0.720521645 0.720205053

200 log2 0.721426877 0.724824651 0.722283768 0.722127182

300 log2 0.704636676 0.724712415 0.714433151 0.710142404

400 log2 0.714535575 0.7323551 0.722995706 0.718913545

500 log2 0.735719652 0.74211784 0.738469775 0.737029937

600 log2 0.716503125 0.722218536 0.719149921 0.717729891

700 log2 0.723037174 0.731058632 0.726792436 0.724931794

800 log2 0.720901299 0.744886874 0.732580893 0.727714217

103

Documents

Non-intrusive Emotion Recognition using Computer …lib.ugent.be/fulltxt/RUG01/002/300/741/RUG01-002300741...Non-intrusive Emotion Recognition using Computer Peripheral Input Analysis