Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Djairhó Geuens
Peripheral Input AnalysisNon-intrusive Emotion Recognition using Computer
Academic year 2015-2016Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems
Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of
Counsellor: Olivier JanssensSupervisors: Prof. dr. ir. Sofie Van Hoecke, Prof. dr. ir. Rik Van de Walle
Djairhó Geuens
Peripheral Input AnalysisNon-intrusive Emotion Recognition using Computer
Academic year 2015-2016Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems
Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of
Counsellor: Olivier JanssensSupervisors: Prof. dr. ir. Sofie Van Hoecke, Prof. dr. ir. Rik Van de Walle
Permission to Use
The author(s) gives (give) permission to make this master’s dissertation available for consul-
tation and to copy parts of this master’s dissertation for personal use.
In the case of any other use, the copyright terms have to be respected, in particular with
regard to the obligation to state expressly the source when quoting results from this master’s
dissertation.
Ghent, 1 June 2016
Preface
For a long time I have been fascinated by the promising possibilities of computer systems
that are emotionally aware. In these rapidly changing times, the original ways of computer
interaction are more than outdated and the need for a more human-like interaction grows.
Therefore, I am very glad to have had the opportunity to indulge in one of the aspects of this
research area. I would also like to convey my sincere appreciation to my counsellor Olivier
Janssens for his great guidance throughout the creation of this work. I would also like to
express my gratitude to the people that have participated in my study for being willing to
provide their personal data in order for this research to succeed. Finally, I would like to
thank Juanita Van Dam for the provided support during the entire time that I have spent as
a student and for reading and revising this work.
Djairho Geuens
Ghent, 1 June 2016
ii
Non-intrusive Emotion Recognition usingComputer Peripheral Input Analysis
by
Djairho Geuens
Master’s dissertation submitted in order to obtain the academic degree of
Master of Science in Computer Science Engineering
Academic year 2015–2016
Supervisors: Prof. dr. ir. Sofie Van Hoecke, Prof. dr. ir. Rik Van de Walle
Counsellor: Olivier Janssens
Department of Electronics and Information Systems
Chair: Prof. dr. ir. Rik Van de Walle
Faculty of Engineering and Architecture
Ghent University
Abstract
Emotional intelligence is a crucial part of human communication and thus to build a truly
intelligent computer system, emotion recognition is required. A computer system that is able
to recognize emotions can use this information to acquire a much broader context to make
decisions, adapt itself or interact with the user. Current solutions for building artificial emo-
tional intelligence are limited by the fact that they require expensive and intrusive hardware.
Different studies have been conducted to overcome these issues using keystroke dynamics,
text content analysis, mouse movement analysis and other contextual factors. However, these
studies mainly use fixed patterns of text which again limits the real-world applicability. This
research combines these techniques without the aforementioned limitations and compares
them to build a system that can recognize emotions based on normal computer interaction
without influencing on the user. A field study was conducted to collect interaction data and
emotional state information. The interaction data was collected during normal user activ-
ity and the emotional state information was collected using self-reports. Machine learning
models were built for different emotional dimensions. From the cross-validation results, some
well-performing models are presented, which achieve accuracies up to 73%, that can be used
for real-world applications and can be further enhanced in future studies. We also provide
our dataset, additional to theoretical concepts and other suggestions that should be explored
in future work.
Index Terms
emotion, recognition, non-intrusive, keystroke, mouse, context, machine learning
iii
1
Non-intrusive Emotion Recognition using ComputerPeripheral Input Analysis
Djairho GeuensSupervisors: Sofie Van Hoecke, Rik Van de Walle
Counsellor: Olivier JanssensGhent University
Abstract—A computer system that is able to recognize emotionscan use this information to acquire a much broader context tomake decisions, adapt itself or interact with the user. Currentsolutions for building artificial emotional intelligence are limitedby the fact that they require expensive and intrusive hardware.These issues can be overcome using keystroke dynamics, textcontent analysis, mouse movement analysis and/or other factors.However, most studies in this research area focus on only one ofthese techniques and only use fixed patterns of text. A system thatcan recognize emotions based on normal computer interaction bycombining the previously mentioned techniques was built and byusing non-fixed text, the aforementioned limitation was negated.A field study was conducted to collect interaction data andemotional state information. The interaction data was collectedduring normal user activity and the emotional state informationwas collected using self-reports. From this data features wereextracted and different models were built based on these features.The best results in this research present classification accuraciesup to 73%. We also provide our dataset, additional to theoreticalconcepts and other suggestions that should be explored in futurework.
Index Terms—emotion, recognition, non-intrusive, keystroke,mouse, context, machine learning.
I. INTRODUCTION
NOWADAYS, we live in an era in which one cannotimagine the abscence of computers. They provide us
daily with support in all kinds of branches in society. Recently,a lot of progress has been made in the area of artificiallyintelligent systems and all sorts of applications of machinelearning. While the advantages of these techniques in everydaysystems are countless and undeniable, they are limited by theirlack of understanding and incorporation of human emotionalcontext in different processes.
To realize any form of emotional intelligence, differentrequirements have to be met. First of all, there needs to bea method that computer systems can use to observe affectivestates in users. During the last few years, research regardingsuch methods has increased. Unfortunately, the solutions thathave been proposed are often limited by different factors.A solution is desired to be non-intrusive for the user. Thisis important because when a user knows that he is beingobserved, it is possible that he – whether deliberately or not– will change his affective state. This is ofcourse undesirable.
D. Geuens is a graduate student at the Faculty of Engineering and Architec-ture, Ghent University, Ghent, Belgium e-mail: [email protected].
S. Van Hoecke, R. Van de Walle and O. Janssens are with the Departmentof Electronics and Information Systems at Ghent University
Furthermore, a minimum of required devices is desired be-cause it is unlikely that real-world applications are going tomake use of different kinds of additional sensor hardware.Often, such specialized equipment can be very expensivebecause it is not present in home or office environments bydefault or because it is medical equipment. This again limitsthe real-world application possibilities of the solution.
Systems that can automatically infer affective states byanalyzing the user’s typing behaviour and mouse movementsas well as text content and other contextual information wereinvestigated. The logging of keystrokes and mouse movementscan be done using a specific piece of software running in thebackground on the computer making it almost undetectableby the average user. Thus, the affective state is minimallyinfluenced by the data collection. Mouse and keyboard are alsostandard equipment in normal home and office environmentsand are very inexpensive. This gives many possibilities for theapplication of affective computing solutions using keystrokedynamics and mouse movements in a real-world environmenton a large scale.
A field study was conducted using experience sampling.Participants were recorded during their daily activities. Theintention was to record emotions while in the moment insteadof retrospectively (at a later time or another place). To enforcethis technique, software was installed on the participants’computers. The software ran as a background process on thecomputer. The participants were free to use their computersas usual in order to avoid external intrusiveness and influ-ence. Subsequently, different features were extracted from thecollected raw data. To build a model that can predict a user’semotional state accuracy, the use of Random Forest regressionmodels, Random Forest classification models, SVM regressionmodels and SVM classification models was examined. Twomain types of models were built: general models and individ-ual models. A general model entails one model that can predictthe emotional state regardless of the user. Such a model wouldbe of great value if it can make highly accurate predictions. Anindividual model will build one model for each user using onlydata from this user. This concept is based on the assumptionthat every person is unique and thus will behave uniquely indifferent emotional states.
2
II. RELATED WORK
A. Emotion Theory
Before discussing techniques to assess emotions, one shouldhave a good understanding of what is actually meant whenusing the words ”affect”, ”mood” and ”emotion”. In thisresearch the definitions given by Forgas [1] are used. Affect isused as a general term to refer to the combination of moodsand emotions. Moods have a rather low intensity, have along duration and little cognitive content. Emotions are muchmore intense, of a shorter duration and more clear concerningcognitive content. A lot of psychological research has beenperformed which resulted in a number of different modelsto explain human emotional behaviour. Two different modelsthat have been developed will be discussed: categorical anddimensional models.
The categorical or discrete model is based on how emotionsare described through language. Ekman presented six basicemotions [2]: anger, surprise, happiness, disgust, sadness andfear. He perfomed research for facial expressions and howpeople in different cultural environments recognize facialexpressions. He found that the six proposed basic emotionswere commonly recognizable in most cultures. The six basicemotions were later expanded to 15 basic emotions [3].However, many other sets of categories have been proposedas well [4].
Another very popular model is the dimensional model,presented by Russell [5]. This model defines emotional statesusing points in a continuous dimensional space. It was sug-gested that continuous space models perform better in out-of-lab application than discrete models [6]. The used dimensionalspace can be either uni-dimensional or multi-dimensional.The PANAS (Positive And Negative Affect Scales) modelis a popular uni-dimensional model [7]. The PAD (pleasure-arousal-dominance) model is a three-dimensional model thatwas developed by Mehrabian and Russell [8] and will alsobe used in this research. The pleasure dimension indicates thevalence measure of an emotional state. The arousal dimensionindicates the level of affective activation of an emotional state.The dominance dimension is used to indicate the amount ofpower or control a person experiences in an emotional state.Often, the PA model is used which is a simpler version of thePAD model leaving out the dominance dimension. However,this simpler PA model has been criticized [9] because it mightnot be possible to fully differentiate between several emotions(e.g. anger and fear). Furthermore, it was found after analyzingdata from Bradley & Lang’s experiment [10] that emotionaldata scattered like a V-shape and showed some clear holesin the PA space. Recently, Lovheim [11] proposed a newthree-dimensional emotion model that uses the monoaminesserotonin, dopamine and noradrenaline neurotransmitters asdimensions instead of pleasure, arousal and dominance.
B. Emotional Experimentation Environment
There are two main approaches to collect emotion data. Oneapproach is to induce moods in the participants in a laboratorysetting and then collecting the desired data. This requires amood induction procedure (MIP). A number of MIP categories
exist [12], including imagination, social interaction and usingfilm or story. While MIPs have the advantage of being ableto control the moods of the participant, the experiments oftentake a considerably long time and do not always guaranteesuccessful induction. Furthermore, individuals may react dif-ferently in a laboratory setting compared to a more naturalisticsetting because the experimental situation may influence theindividual [13].
A naturalistic setting during an experiment aims to observeparticipants in their natural setting to have minimal influ-ence on the participant’s behaviour. The experience samplingmethodology (ESM) [14] is such an approach. In this tech-nique a participant’s experiences are recorded as they occurat certain moments in time. This has the advantage that itcan capture daily life from moment to moment without theproblem of recall issues and that it is much easier for partici-pants to indicate their experiences at the moment that they areexperiencing them. A main disadvantage of using a naturalisticsetting is that all techniques provide the experimenter withsubjective information. This could lead to biased informationand individuals may repress certain information or change theirresponses to fit the norm of the participant’s culture.
C. Keystroke Dynamics
Keystroke dynamics is the study of the characteristicsthat are present in a user’s typing rhythm when using akeyboard. An individual’s typing behaviour can change whenthe individual experiences different emotions. This means thatinformation about a user’s typing behaviour may allow usto infer the user’s emotional state. Two main approaches forkeystroke dynamics analysis exist: fixed text analysis and freetext analysis. The former usually requires participants to typeone or more fixed pieces of text multiple times during the datacollection process. Using fixed text to build a model impliesthat this model can only be used at those moments that theuser types one or more of the fixed pieces of text on whichthe model was trained. The latter allows any sequence of textas input. This enables the model to be used during continuousmonitoring, which is very desireable for affect recognition be-cause this means that emotional information is available in thecomputer system at any time. Furthermore, using this approachit is also possible to obtain the keystroke data unobtrusively,which is also highly desireable. Typical keystroke dynamicsfeatures are keypress duration and latencies between multiplekeystrokes.
D. Text Content Analysis
The actual content of text contains a lot of valuable informa-tion concerning affect. A possible starting point of a linguisticanalysis of text to extract emotional information is to usespecific affective lexicons. Examples of such lexicons are theAffective Norms for English Words (ANEW) [15], SentiWord-Net [16] and WordNet Affect [17]. Clore et al. [18] arguedthat words need to be distinguished based on the fact whetherthey directly refer to emotional states or contain an indirectreference that depends on the context. A more abstract analysisusing specific textual features is also possible. Vizer et al. [19]
3
used a number of features defined by Zhou et al. [20]. It isalso possible to calculate features based on annotated datasetsand using these in a machine learning algorithm.
E. Mouse Behaviour
Except for text content and keystroke dynamics, also mousebehaviour may contain usefull information on an individual’semotional state. One can distinguish single clicks, doubleclicks, either using the left, right or maybe middle button.Furthermore, one can observe mouse movements and mousewheel movements. From the mouse movement data, the dis-tance, angle and speed features between pairs of data pointscan be extracted.
F. Emotion Recognition
There have been numerous studies investigating the possibleapplications of keyboard and mouse behaviour in authentica-tion and security [21–26] but for this work the applicationsregarding emotion recognition are more interesting.
Zimmerman et al. [27] described a method to correlatekeyboard and mouse interaction with affective states using aMIP and the PA emotion model. Later, Vizer et al. [19] pro-posed a new way of assessing cognitive and physical stress byanalyzing keystroke dynamics using both content and timingfeatures. They achieved correct classification rates of 62.5%for physical stress and 75% for cognitive stress. Epp et al. [28]focused more on actual emotions and investigated the possi-bility to identify different emotional states using keystrokeanalysis. They focused on gathering keystroke data in anatural context using the experience sampling methodologyrather than a laboratory environment. They achieved reliableaccuracy rates for fixed text ranging from 77.4% to 87.8%for confidence, hesitance, nervousness, relaxation, sadness andtiredness. Tsoulouhas et al. [29] also introduced a method todetect student boredom during the attendance in an onlinelesson that was followed in a laboratory setting and achieved acorrect classification rate for fixed text above 90%. Except foranalyzing keystroke and mouse features, it is also possibleto use other additional sensors that are present in today’ssmartphones. LiKamWa et al. [30] performed an experimentalstudy, which they extended in [31], to classify different moodsusing this kind of data. Continuing in the area of smartphones,Lee et al. [32] investigated the possibilities to automaticallyrecognize emotions in social network service posts. Theyachieved an average classification accuracy of 67.5%. Later,Tsui et al. [33] validated the hypothesis about the existence ofthe difference in typing patterns between different emotionalstates using the facial feedback hypothesis. Nahin et al. [34]focused on textual contents as well as keystroke dynamics.They used WordNet and the ISEAR dataset in combinationwith a vector space model. Hernandez et al. [35] presenteda method to evaluate a user’s boredom and frustration inan intelligent learning environment but focused on free textin contrast to [29]. Recently, Lee et al. conducted anotherstudy [36] that examined the variance in keystroke typingpatterns caused by emotions using visual stimuli to induceemotional states. They concluded that the effect of emotion
is significant but small compared to the individual variabil-ity. Along with the increasing popularity of fuzzy logic,Shukla et al. [37] and Bakhtiyari et al. [38–40] experimentedwith fuzzy models. Finally, a number of meta-studies wereconducted by Kolakowska [41], [42] and also a number ofinteresting possible applications were presented in [43].
The best aspects of these studies are combined in order topresent a methodology for this research.
III. METHODOLOGY
A. Data Collection
The first step in this work is the collection of data. Afield study was performed in which participants’ keystroke,mouse, location and weather data were gathered together withsubjective indications of emotional states in the PAD emotionmodel. The study used an experience sampling method. Theparticipants were periodically requested to indicate their emo-tional state while other data was continuously collected in thebackground by custom built software that was installed on theparticipants’ computers.
The field study was conducted from November 15th, 2015until May 1st, 2016 with 14 participants contributing datafor, on average, 22 weeks. There were no restrictions ontheir activities during the study. Upon signup, each participantprovided their name, e-mail address, gender, birthdate, place ofbirth, occupation, education, nationality, first language, mostused language on the computer, dominant hand, typing skills,computer skills, percentage of total computer time spent onthe concerned computer, keyboard layout, mouse type andcomputer type. All participants used the Windows operatingsystem (Windows 7 or higher).
B. Data Processing
The second step before being able to build models was theprocessing of the data to extract features.
1) Keystroke Features: The features, that were extractedfrom the keydown and keyup events, consist of timing featuresand frequency features.
To be able to calculate keystroke timing features, corre-sponding keydown and keyup events needed to be matched.Therefore, some special cases needed to be considered as well(e.g. multiple keydown event corresponding with one keyupevent). The average typing speed and the mean, weightedmean, maximum, minimum, standard deviation, variance,mode, median, skew and kurtosis of the down-to-down latency,up-to-up latency, up-to-down latency and down-to-up durationwere extracted. This results in 41 features. An outlier removalprocess was also applied to the latencies and durations usingthe interquartile range of the values in order to deal with longperiods of inactivity.
The frequency features that were extracted were thenumerical character frequency, alphabetical character fre-quency, delete frequency, backspace frequency, error frequency(backspace or delete), shift frequency, space frequency, arrowfrequency, caps lock frequency, return frequency, punctuationfrequency, average word length and long pause frequency.
4
2) Textual Content Features: Before extracting textual con-tent features, the pieces of text contained by each sampleneeded to be reconstructed from the keydown events. Next, thetext pieces were converted to a matrix of token counts and thenthis matrix was transformed to a normalized term frequencyrepresentation. The number of features that are produced bythis technique depends on the collection of text pieces that isused as input.
3) Mouse Movement Features and Contextual Features:One mouse movement feature was extracted: the averagemouse speed. The contextual features that were extractedare the temperature, humidity, pressure and the discomfortindex. However, due to the fact that some of the participantshave disabled the location services on their computer, not allsamples contain weather data as this depends on the locationinformation.
C. Model Building
In this work, different types of machine learning modelswere built. A first subdivision is made based on the factwhether they are regression or classification models. Then, asecond subdivision is made based on the fact whether they arebuilt using data of all participants or individual participants.Finally a third subdivision is made based on the fact whetherthe models are built using keystroke dynamics data, mousedynamics data and contextual data or they are built usingtextual content data.
1) Regression: For the general regression models, threedegrees of freedom were used. The first is whether contextualdata, i.e. weather data, is used. This yields different results assome participants did not have their location services enabledwhich caused their data not to contain weather information.As the models that were used in this research are not capableof dealing with missing data, this means that all samplesthat do not contain the weather information needed to beremoved. The second degree of freedom is whether a modelwas built for each separate dimension of the PAD model orwhether one model was built to predict the entire PAD-tuple(joint). The third degree of freedom that was used is whetherfeature selection was performed or not. To perform featureselection, a random forest model was built using all featuresand then the 20 most important features were extracted.Afterwards, the actual regression(/classification) model wasbuilt using only these features. In total, 8 different randomforest models for general dynamics regression were built andevaluated. Individual dynamics regression models were builtfor 8 participants as only these participants who provided atleast 59 samples (such that the amount of samples was at leastequal to the number of possible features) were included. Thesame degrees of freedom were used to build multiple variantsof the individual models.
To build general text content regression models, the textcontent feature data of all participants was used to train aSVM for regression. There were no degrees of freedom inthese models as the SVM implementation that was used isnot capable of predicting the entire PAD-tuple, hence differentmodels for each dimension were built. Furthermore, all textual
TABLE IPAD MAPPING ACCORDING TO OCTANTS IN PAD SPACE
PAD octant EmotionP-A-D- BoredP-A-D+ DisdainfulP-A+D- AnxiousP-A+D+ HostileP+A-D- DocileP+A-D+ RelaxedP+A+D- DependentP+A+D+ Exuberant
content features are equally important by definition (no termimportances were defined such that each term is equally im-portant) so there can be no degrees of freedom here. Individualmodels were built for 13 participants.
Predicted PAD axis values needed to be interpreted asan emotional state, or better yet, a weighted combinationof multiple emotional states. This requires mapping discreteemotional states onto the PAD space. Different mappings havebeen proposed and empirical approaches to building suchmappings have been taken [44]. When we assume that a personexperiences a weighted combination of multiple emotions,fuzzy logic can be used to determine the extent to which eachemotion is present, given PAD axis values. This can be doneby defining a set of fuzzy rules that define the relationshipbetween the PAD axis values and each emotion. For example,if the mapping, presented in Table I, is used, fuzzy rulescan be defined of the form: IF pleasure IS positiveAND arousal IS negative AND dominance ISpositive THEN relaxed IS present. The fuzzyterms ”negative”, ”positive” and ”present” can then be de-fined by membership functions that take values between 0and 1. The IS-operator calculates the function value for thevariable on its left-hand side using the membership functionfor the fuzzy term on its right-hand side. The AND-operatorcan be implemented using a so-called t-norm and also forcombining multiple rules concerning the same fuzzy terms anddefuzzification there are multiple possible approaches [45].Applying such fuzzy rules to the predicted PAD axis valuesresults in a set of membership values for each emotion that canbe interpreted as the extent to which each of these emotionsare present in the current emotional state of a person. Usingfuzzy logic has the advantage of reducing the importance ofthe accuracy of predicted PAD axis values as less accuratevalues can still result in a correct conclusion concerning thedominant emotion due to the fact that fuzzy logic has thecapacity to take into account the inherent fuzziness of theemotion information. Furthermore, it is also possible to drawconclusions about which emotions are likely to occur at thesame time.
2) Classification: To be able to perform classification, thePAD emotion model first needs to be divided into a numberof different classes, i.e. discretized. Three different approacheswere used to divide the PAD emotion model in differentclasses. The first approach uses the k-means clustering algo-rithm to assign all samples that belong to the same cluster tothe same class so that k classes are obtained. The value of k
5
was chosen to be 8 to divide the entire PAD space into differentclasses. This choice is made based on the fact that the PADspace is three-dimensional and thus contains 8 octants. If onlyone PAD dimension needed to be divided into different classes,the value of k was chosen to be 2. The second approach splitseach PAD dimension into two parts (a negative and a positivepart, respectively having values between 0 and 50 and between50 and 100). Using this approach, two classes per dimensionor 23 = 8 classes for the entire PAD space are obtained. Thethird approach is similar to the second one but splits each PADdimension into three parts (a negative, a neutral and a positivepart, respectively having values between 0 and 40, between40 and 60 and between 60 and 100). This yields three classesper dimension or 33 = 27 classes for the entire PAD space.The goal of classification models is then to predict the correctclass for each sample.
For the general dynamics classification models, six degreesof freedom were used. The first is whether contextual data,i.e. weather data, is used, as described above. The seconddegree of freedom is again whether a model was built foreach separate dimension of the PAD model or whether onemodel was built to predict the entire PAD-tuple. The thirddegree of freedom is whether a class system based on thek-means clusters is used, a system based on positive-negative-scales is used or a system based on positive-neutral-negative-scales is used. The fourth degree of freedom is whether noclass weight balancing is performed, general class weightbalancing is performed or subsample class weight balancing isperformed. General class weight balancing associates weightswith classes that are inversely proportional to class frequenciesin the input data. Subsample class weight balancing essentiallydoes the same thing but adjusts the class weights accordingto the class frequencies in the bootstrap sample for eachtree grown. The fifth degree of freedom is whether or notsubsampling is performed in order to reduce bias due toclass skew. Subsampling removes random samples from eachclass that contains more samples than the class that occursleast frequently until every class contains an equal numberof samples. The sixth degree of freedom that was used iswhether feature selection was performed or not. The featureselection process is the same as for the regression models. Intotal, 96 different random forest models for general dynamicsclassification were built and evaluated. For the individualdynamics classification models the same degrees of freedomas for the general dynamics classification models were used tobuild multiple variants of the models. They were built for thesame 8 participants as in the individual dynamics regressionmodels.
To build general text content classification models, the textcontent feature data of all participants is used to train a SVMfor classification. There are four degrees of freedom in thesemodels. The first degree of freedom is again whether a modelwas built for each separate dimension of the PAD model orwhether one model was built to predict the entire PAD-tuple.The second degree of freedom is the type of class systemthat is used. The third degree of freedom is the type of classweight balancing that is used. The fourth degree of freedom iswhether or not subsampling is performed. In total, 18 different
SVM models for general text classification were built andevaluated. For the individual text content classification modelsthe same degrees of freedom as for the general text contentclassification models were used to build multiple variants ofthe models. They were built for the same 13 participants as inthe individual text content regression models.
IV. RESULTS
The results for all regression models were not satisfying anddid not perform much better than always predicting the meanand therefore will not be presented here.
The highest ROC AUC score (area under the receiver oper-ating characteristic curve) for general dynamics classificationmodels is obtained with the combination of not includingweather data, using the joint PAD space that is dividedinto classes using k-means clustering with k = 8, not usingclass weight balancing, using subsampling and using featureselection. This model was built using 704 labeled samplesand achieves a ROC AUC score of 0.75. The correspondingconfusion matrix is presented in Table II. However, all modelsbuilt on separate PAD dimensions using classes determined bythe k-means clusters (with k = 2) or by the positive/negativeclass definition, no class weight balancing and subsamplinghave similar ROC AUC scores to this model but also havemuch higher precision, recall and F1 scores (≈ 0.73). Hence,the latter models actually perform better. It was observed thatthe same types of models also perform best for individualdynamics classification, general text content classification andindividual text content classification when consdering preci-sion, recall, F1 and ROC AUC scores. The confusion matrix ofthe best general text content classification model is presentedin Table III.
V. DISCUSSION
A. Overall Results
The bad results for the regression models do not mean thatthey cannot be used for this application. The best explanationfor the bad performances is the limited size and bias of thedataset. Possibly, a larger dataset that contains more uniformlydistributed samples will yield better regression models. Also,the regression models that were built using the text contentfeatures did not perform well. This can again be explainedby the limited size of the dataset. The amount of text contentfeatures that are extracted on average is much higher than theamount of samples that are available, resulting in a very sparsefeature space.
The classification models presented much better results anda clear pattern of a family of models that perform very well.The fact that the models using k-means clustering to generateclasses perform very similar to the models using the classesgenerated according to the positive/negative class definitioncan be explained by the value distributions for each PAD axis.These distributions contain two clusters of values for each PADaxis, which will very likely be the same clusters that the k-means clustering algorithm will find. Indeed, inspecting theclasses generated by the k-means algorithm (with k = 2) con-firms this presumption. Presumably, the presence of these two
6
TABLE IICONFUSION MATRIX FOR GENERAL DYNAMICS CLASSIFICATION MODEL WITH BEST ROC AUC SCORE
1 2 3 4 5 6 7 81 55.68% 5.68% 5.68% 7.92% 5.68% 14.8% 3.44% 1.12%2 7.92% 54.56% 4.56% 2.24% 10.24% 3.44% 6.8% 10.24%3 7.92% 7.92% 40.88% 5.68% 13.6% 3.44% 7.92% 12.48%4 4.56% 7.92% 1.12% 65.92% 6.8% 2.24% 5.68% 5.68%5 6.8% 5.68% 6.8% 3.44% 62.48% 10.24% 4.56% 0%6 3.44% 1.12% 3.44% 2.24% 6.8% 77.28% 2.24% 3.44%7 6.8% 9.12% 4.56% 7.92% 4.56% 6.8% 51.12% 9.12%8 5.68% 6.8% 12.48% 4.56% 3.44% 2.24% 12.48% 52.24%
TABLE IIICONFUSION MATRIX FOR BEST GENERAL TEXT CONTENT CLASSIFICATION MODEL
1 2 3 4 5 6 7 81 50% 17.04% 1.12% 9.12% 10.24% 4.56% 4.56% 3.44%2 4.72% 51.12% 2.24% 4.56% 13.6% 6.8% 11.36% 5.68%3 5.68% 11.36% 55.68% 2.24% 6.8% 2.24% 10.24% 5.68%4 7.92% 15.92% 11.36% 34.08% 5.68% 1.12% 13.6% 10.24%5 10.24% 21.6% 13.6% 2.24% 34.08% 5.68% 5.68% 6.8%6 5.68% 12.48% 4.56% 4.56% 13.6% 45.44% 7.92% 5.68%7 3.44% 18.16% 3.44% 9.12% 6.8% 4.56% 52.24% 2.24%8 9.12% 22.72% 6.8% 10.24% 17.04% 3.44% 3.44% 27.28%
obvious clusters is caused by the tendency of the participantsto always move the sliders while indicating their emotionalstate, resulting in the slider being either positioned to the leftor right of the center but never in the center. These two clustersalign very closely to the positive/negative class definitionwhich means that these two ways of defining classes willyield very similar classes. Since the produced classes are verysimilar, the results will also be very similar. One of the mainreasons for the better performance of classification models canprobably be attributed to the subsampling technique, solvingthe class skew problem. This was one of the main limitationsfor the regression models. A similar intelligent technique couldmaybe be applied to the continuous dataset as well to makethe regression models perform better as well.
B. Contextual Data and Feature Selection
It seems that features concerning weather information (tem-perature, pressure, humidity and discomfort index) do notmake much of a difference for model performances. However,for regression models there is a pattern indicating that weatherfeatures can be useful. One of the main problems is thatincluding weather data causes a lot of samples to be removed,which in turn causes a decrease in model performance due toa limited dataset. At first sight, one might draw the conclusionthat weather data does not provide discriminative informationto the model. However, in classification tasks, this effect ismuch smaller and almost non-existent, indicating that weatherfeatures might provide useful data. Thus, due to the limitedsize of the dataset, the effects of using contextual data areobfuscated.
Finally, using feature selection for models does not have asignificant impact on the results. This is consistent with thefeature importances for the best general dynamics regressionand classification models provided by the Random Forestalgorithm.
VI. CONCLUSION
First, the pleasure-arousal-dominance emotion model wasused providing a more discriminative way of emotion defini-tion. This emotion model also allows for regression techniquesto be applied to the emotion recognition task. Second, opposedto many other studies that use fixed text, free text data wasused in this research. The potential for real-world applicationis much bigger when free text is used. Third, models werebuilt based on a combination of dynamics (both keystrokeand mouse) features and contextual features (i.e. weather data)and compared to investigate the added value of contextualfeatures. Fourth, both general models and individual modelswere built to investigate the possible advantages of using moreuser-specific models to make accurate predictions. Fifth, amethodology was presented to use regression models in morepractical settings using fuzzy logic. Sixth, well-performingclassification models were created for predicting two class-levels on each separate dimension of the PAD model basedon dynamics features. Finally, an easy-to-use dataset with ahierarchical structure was constructed and prepared for publicuse. One of the main problems in this research area is thatmost studies use a different dataset, different metrics, differentfeatures and/or different emotion models. This causes a lot ofdifficulties when comparing findings and results. By allowingour dataset to be used in other studies, the problem of theusage of different datasets and emotion models is solved.
One of the limitations of this study was the reliance on theparticipants for establishing the ground truth. To overcomethis, other techniques may be used for identifying the par-ticipant’s emotional state. Also the usage of textual contentcan be improved by detecting emotionally charged words, textnormalization, stemming techniques and language detection.Finally, clustered models could also be built. This is basedon the assumption that every person is unique. However, it isassumed that it is possible that there exists a finite number
7
of clusters or user profiles, that comprise people that showsimilar computer behaviour for each emotional state. It is thenpossible to build one model for each profile.
This study presents classification models that are suitablefor real-world application but it also imposes some ethicalconcerns for privacy, as the monitoring process that is useddoes not get noticed by the user.
REFERENCES
[1] J. P. Forgas, “Mood and judgment: the affect infusion model (aim).”Psychological bulletin, vol. 117, no. 1, p. 39, 1995.
[2] P. Ekman and W. V. Friesen, “Constants across cultures in the face andemotion.” Journal of personality and social psychology, vol. 17, no. 2,p. 124, 1971.
[3] P. Ekman, “An argument for basic emotions,” Cognition & emotion,vol. 6, no. 3-4, pp. 169–200, 1992.
[4] P. N. Johnson-Laird and K. Oatley, “The language of emotions: Ananalysis of a semantic field,” Cognition and emotion, vol. 3, no. 2, pp.81–123, 1989.
[5] J. A. Russell, “A circumplex model of affect.” Journal of personalityand social psychology, vol. 39, no. 6, p. 1161, 1980.
[6] H. Gunes, B. Schuller, M. Pantic, and R. Cowie, “Emotion representa-tion, analysis and synthesis in continuous space: A survey,” in AutomaticFace & Gesture Recognition and Workshops (FG 2011), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 827–834.
[7] D. Watson, L. A. Clark, and A. Tellegen, “Development and validation ofbrief measures of positive and negative affect: the panas scales.” Journalof personality and social psychology, vol. 54, no. 6, p. 1063, 1988.
[8] A. Mehrabian, “Basic dimensions for a general psychological theoryimplications for personality, social, environmental, and developmentalstudies,” 1980.
[9] C. Kaernbach, “On dimensions in emotion psychology,” in AutomaticFace & Gesture Recognition and Workshops (FG 2011), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 792–796.
[10] P. J. Lang, M. K. Greenwald, M. M. Bradley, and A. O. Hamm, “Lookingat pictures: Affective, facial, visceral, and behavioral reactions,” Psy-chophysiology, vol. 30, pp. 261–261, 1993.
[11] H. Lovheim, “A new three-dimensional model for emotions andmonoamine neurotransmitters,” Medical hypotheses, vol. 78, no. 2, pp.341–348, 2012.
[12] R. Westermann, K. Spies, G. Stahl, and F. W. Hesse, “Relative effec-tiveness and validity of mood induction procedures: a meta-analysis,”European Journal of Social Psychology, vol. 26, no. 4, pp. 557–580,1996.
[13] M. Martin, “On the induction of mood,” Clinical Psychology Review,vol. 10, no. 6, pp. 669–697, 1990.
[14] J. M. Hektner, J. A. Schmidt, and M. Csikszentmihalyi, Experiencesampling method: Measuring the quality of everyday life. Sage, 2007.
[15] M. M. Bradley and P. J. Lang, “Affective norms for english words(anew): Instruction manual and affective ratings,” Technical Report C-1, The Center for Research in Psychophysiology, University of Florida,Tech. Rep., 1999.
[16] A. Esuli and F. Sebastiani, “Sentiwordnet: A publicly available lexicalresource for opinion mining,” in Proceedings of LREC, vol. 6. Citeseer,2006, pp. 417–422.
[17] C. Strapparava, A. Valitutti et al., “Wordnet affect: an affective extensionof wordnet.” in LREC, vol. 4, 2004, pp. 1083–1086.
[18] G. L. Clore, A. Ortony, and M. A. Foss, “The psychological foundationsof the affective lexicon.” Journal of personality and social psychology,vol. 53, no. 4, p. 751, 1987.
[19] L. M. Vizer, L. Zhou, and A. Sears, “Automated stress detection usingkeystroke and linguistic features: An exploratory study,” InternationalJournal of Human-Computer Studies, vol. 67, no. 10, pp. 870–886, 2009.
[20] L. Zhou, J. K. Burgoon, J. F. Nunamaker, and D. Twitchell, “Automatinglinguistics-based cues for detecting deception in text-based asynchronouscomputer-mediated communications,” Group decision and negotiation,vol. 13, no. 1, pp. 81–106, 2004.
[21] R. S. Gaines, W. Lisowski, S. J. Press, and N. Shapiro, “Authenticationby keystroke timing: Some preliminary results,” DTIC Document, Tech.Rep., 1980.
[22] F. Monrose and A. Rubin, “Authentication via keystroke dynamics,” inProceedings of the 4th ACM conference on Computer and communica-tions security. ACM, 1997, pp. 48–56.
[23] F. Monrose and A. D. Rubin, “Keystroke dynamics as a biometric forauthentication,” Future Generation computer systems, vol. 16, no. 4, pp.351–359, 2000.
[24] M. Pusara and C. E. Brodley, “User re-authentication via mouse move-ments,” in Proceedings of the 2004 ACM workshop on Visualization anddata mining for computer security. ACM, 2004, pp. 1–8.
[25] M. Fairhurst, D. Costa-Abreu et al., “Using keystroke dynamics forgender identification in social network environment,” in Imaging forCrime Detection and Prevention 2011 (ICDP 2011), 4th InternationalConference on. IET, 2011, pp. 1–6.
[26] I. Traore et al., “Biometric recognition based on free-text keystrokedynamics,” Cybernetics, IEEE Transactions on, vol. 44, no. 4, pp. 458–472, 2014.
[27] P. Zimmermann, S. Guttormsen, B. Danuser, and P. Gomez, “Affectivecomputing – a rationale for measuring mood with mouse and keyboard,”International journal of occupational safety and ergonomics, vol. 9,no. 4, pp. 539–551, 2003.
[28] C. Epp, M. Lippold, and R. L. Mandryk, “Identifying emotional statesusing keystroke dynamics,” in Proceedings of the SIGCHI Conferenceon Human Factors in Computing Systems. ACM, 2011, pp. 715–724.
[29] G. Tsoulouhas, D. Georgiou, and A. Karakos, “Detection of learner’saffective state based on mouse movements,” J. Comput, vol. 3, pp. 9–18,2011.
[30] R. LiKamWa, Y. Liu, N. D. Lane, and L. Zhong, “Can your smartphoneinfer your mood,” in PhoneSense workshop, 2011, pp. 1–5.
[31] ——, “Moodscope: Building a mood sensor from smartphone usagepatterns,” in Proceeding of the 11th annual international conference onMobile systems, applications, and services. ACM, 2013, pp. 389–402.
[32] H. Lee, Y. S. Choi, S. Lee, and I. Park, “Towards unobtrusive emotionrecognition for affective social communication,” in Consumer Communi-cations and Networking Conference (CCNC), 2012 IEEE. IEEE, 2012,pp. 260–264.
[33] W.-H. Tsui, P. Lee, and T.-C. Hsiao, “The effect of emotion onkeystroke: an experimental study using facial feedback hypothesis,”in Engineering in Medicine and Biology Society (EMBC), 2013 35thAnnual International Conference of the IEEE. IEEE, 2013, pp. 2870–2873.
[34] A. N. H. Nahin, J. M. Alam, H. Mahmud, and K. Hasan, “Identifyingemotion by keystroke dynamics and text pattern analysis,” Behaviour &Information Technology, vol. 33, no. 9, pp. 987–996, 2014.
[35] A. Hernandez-Aguila, M. Garcia-Valdez, and A. Mancilla, “Affectivestates in software programming: Classification of individuals based ontheir keystroke and mouse dynamics,” Intelligent Learning Environ-ments, p. 27, 2014.
[36] P.-M. Lee, W.-H. Tsui, and T.-C. Hsiao, “The influence of emotion onkeyboard typing: an experimental study using visual stimuli,” Biomedicalengineering online, vol. 13, no. 1, p. 81, 2014.
[37] P. Shukla and R. Solanki, “Web based keystroke dynamics applica-tion for identifying emotional state,” Internation journal of advancedresearch in computer science and communication engineering, vol. 2,no. 11, pp. 4489–4493, 2013.
[38] K. Bakhtiyari and H. Husain, “Fuzzy model in human emotions recog-nition,” arXiv preprint arXiv:1407.1474, 2014.
[39] ——, “Fuzzy model of dominance emotions in affective computing,”Neural Computing and Applications, vol. 25, no. 6, pp. 1467–1477,2014.
[40] K. Bakhtiyari, M. Taghavi, and H. Husain, “Implementation ofemotional-aware computer systems using typical input devices,” inIntelligent Information and Database Systems. Springer, 2014, pp.364–374.
[41] A. Kolakowska, “A review of emotion recognition methods based onkeystroke dynamics and mouse movements,” in Human System Interac-tion (HSI), 2013 The 6th International Conference on. IEEE, 2013,pp. 548–555.
[42] ——, “Recognizing emotions on the basis of keystroke dynamics,” inHuman System Interactions (HSI), 2015 8th International Conferenceon. IEEE, 2015, pp. 291–297.
[43] A. Kołakowska, A. Landowska, M. Szwoch, W. Szwoch, and M. Wrobel,“Emotion recognition and its applications,” in Human-Computer SystemsInteraction: Backgrounds and Applications 3. Springer, 2014, pp. 51–62.
[44] H. Hoffmann, A. Scheck, T. Schuster, S. Walter, K. Limbrecht, H. C.Traue, and H. Kessler, “Mapping discrete emotions into the dimensionalspace: An empirical approach,” in Systems, Man, and Cybernetics(SMC), 2012 IEEE International Conference on. IEEE, 2012, pp. 3316–3320.
8
[45] L. A. Zadeh, “Soft computing and fuzzy logic,” IEEE software, vol. 11,no. 6, p. 48, 1994.
Contents
1 Introduction 1
1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Solution Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Parts of the Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Study Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Feature Extraction and Selection . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 Model Building and Validation . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 6
2.1 Emotions Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Affect, Mood and Emotions . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Emotion Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Emotional Experimentation Environment . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Laboratory Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Naturalistic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.6 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.7 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.8 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.9 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Keystroke Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Fixed Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Free Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
xii
Contents
2.4.3 Keystroke Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Text Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Mouse Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Authentication and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Emotion Recognition in Computer Systems . . . . . . . . . . . . . . . . . . . 25
3 Data Collection 32
3.1 Field Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.3 Privacy Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.4 Meantime Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.5 Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Participant Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Installation, Automatic Updates and Heartbeat . . . . . . . . . . . . . 37
3.3.2 Capturing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.4 Server and Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Data Processing 48
4.1 Data Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Keystroke Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Textual Content Features . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.3 Mouse Movement Features . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.4 Contextual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Model Building 56
5.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.1 General Dynamics Models . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.2 Individual Dynamics Models . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.3 General Text Content Models . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.4 Individual Text Content Models . . . . . . . . . . . . . . . . . . . . . 59
5.1.5 Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 General Dynamics Models . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Individual Dynamics Models . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.3 General Text Content Models . . . . . . . . . . . . . . . . . . . . . . . 70
xiii
Contents
5.2.4 Individual Text Content Models . . . . . . . . . . . . . . . . . . . . . 70
6 Discussion 74
6.1 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Contextual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.4 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.5 Prediction Performance for PAD-dimensions . . . . . . . . . . . . . . . . . . . 78
6.6 Number of Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.7 Model Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.8 Clustered Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.9 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7 Conclusion 84
7.1 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3 Potential for Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A Concept Matrix of Related Work 94
B Participant Registration Form 98
C Participant Consent Form 99
D Software Download Page 100
E Model Hyperparameters 101
E.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
E.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
xiv
Abbreviations
k-NN k -Nearest Neighbors. 12, 16, 24, 28, 95–97
ANEW Affective Norms for English Words. 22
ANN Artificial Neural Networks. 12, 95–97
ANOVA Analysis Of Variance. 28
API Application Programming Interface. 43
BN Bayesian Network. 12
CART Classification And Regression Tree. 13
CSV Comma-separated Values. 48
CWB Class Weight Balancing. 65, 68, 71, 73
D2D Down-to-down. 50, 51
D2U Down-to-up. 50, 51
DCS-LA Dynamic Classifier Selection using Local Accuracy. 24
DT Decision Tree. 12
EER Equal Error Rate. 24
EM Expectation Maximization. 19, 20
ESM Experience Sampling Methodology. 11
EVS Explained Variance Score. 56–58, 60, 74, 79, 102
FAR False Acceptance Rate. 24
FRR False Rejection Rate. 24
xv
Abbreviations
GMM Gaussian Mixture Model. 12
IAPS Internation Affective Picture System. 28, 97
IP Internet Protocol. 34
IQR Interquartile Range. 51
ISEAR Internation Survey on Emotion Antecedents and Reactions. 22, 27
JSON Javascript Object Notation. 46, 81
MAE Mean Absolute Error. 56–60, 74, 78, 79, 102
MIP Mood Induction Procedure. 11
MSE Mean Squared Error. 56–58, 60, 74, 79, 102
NB Naive Bayes. 12
NSIS Nullsoft Scriptable Install System. 47
PA pleasure-arousal. 8, 9, 25, 85, 94, 96, 97
PAD pleasure-arousal-dominance. xiv, 4, 8–10, 33, 36, 43, 44, 56–59, 61–65, 67, 68, 70–73,
75, 76, 78, 84–86, 101
PANAS Positive And Negative Affect Scales. 8, 28, 85, 97
PHP Hypertext Preprocessor. 34, 45
RF Random Forest. 12
ROC AUC Receiver Operating Characteristic Area Under Curve. 63, 64, 67, 68, 70, 74,
78–80, 86, 103
SQL Structured Query Language. 46, 48
SS Subsampling. 65, 68, 71, 73
SVM Support Vector Machine. 4, 12, 16, 30, 56, 58, 70, 95, 97
TF-IDF Term Frequency - Inverse Document Frequency. 23, 53
TLS/SSL Transport Layer Security/Secure Sockets Layer. 34
xvi
Abbreviations
U2D Up-to-down. 50–52
U2U Up-to-up. 50, 51
XML Extensible Markup Language. 41
xvii
Chapter 1
Introduction
1.1 Problem Definition
Nowadays, we live in an era in which one cannot imagine the abscence of computers. They
provide us daily with support in all kinds of branches in society. Recently, a lot of progress
has been made in the area of artificially intelligent systems and all sorts of applications of
machine learning. While the advantages of these techniques in everyday systems are countless
and undeniable, they are limited by their lack of understanding and incorporation of human
emotional context in different processes.
If a computer system would possess some form of emotional intelligence, it would have a
much broader view on contextual aspects during decision making. This context can be used
to dynamically adapt applications in order to enhance productivity, effectiveness and user-
friendliness. For example, an operating system could use emotional state information of its
user to better assist during interaction. Operating system developers are working very hard
to incorporate an artificially intelligent assistant that can be used by users to facilitate their
day-to-day activities. However, as these assistants require a certain amount of time to get to
know a user, there has to be some way to indicate whether or not the assistant is displaying
unwanted behaviour so that the assistant can improve itself over time using this information.
The operating system could also assist during text communication (such as e-mails and instant
messaging) to avoid ambiguity and misunderstandings between correspondents. It could use
both the content and the emotional tone of the message to enhance the structure and content
of the message for emotion expression. Monitoring of the emotional state could also have its
application in health care. For example a system could detect whether a user is close to a
burn-out or is in a depressed state for a long period of time and take appropriate actions.
Furthermore, mission-critical systems could detect when a user is in a fatigued, stressed or
distracted state and avoid catastrophic mistakes.
These are all situations in which a system takes actions as a response to certain emotions.
1
Chapter 1. Introduction
However, a computer system could also try to infer the cause of certain emotions through
different situation variables and respond to the emotion by adapting these variables. For
example, while playing a video game, when the system detects that the player is bored or
frustrated it can alter the course of the game to make it more challenging or relaxing (much
like feedback loops in control systems). This indicates that it is not only possible for an
emotionally intelligent system to observe emotional states but influence them as well. This
is an important thought as it means that such a system considerably approaches human
emotional capabilities. Hence, this means that it cannot be ruled out that it is possible for
a computer system to learn and simulate emotions of its own in response to observation of
human emotions. This is studied in affective computing [53]. It is thus clear that emotion
awareness is a crucial component of any system that can justifiably be called truly artificially
intelligent. Thinking of such systems and machines, one can ask himself a rather philosophical
question: What separates humans from such truly artificially intelligent machines?
To realize any form of emotional intelligence discussed here, different requirements have to
be met. First of all, there needs to be a method that computer systems can use to observe
affective1 states in users. During the last few years, research regarding such methods has
increased. Unfortunately, the solutions that have been found are often limited by different
factors. A solution is desired to be non-intrusive for the user. This means that the solution
should not invade the personal space of the user or become too noticeably involved in the
person’s life without being invited. The non-intrusiveness of the solution is important because
when a user knows that he is being observed, it is possible that he – whether deliberately or
not – will change his affective state. This is ofcourse undesirable. Furthermore, a minimum of
required devices is desired because it is unlikely that real-world applications will want to make
use of different kinds of extra sensor hardware. Often, such specialized equipment can be very
expensive because it is not present in home or office environments by default or because it is
medical equipment. This again limits the real-world application possibilities of the solution.
Except for these requirements, some other aspects of a solution need to be taken into account
to judge its applicability in a real-world environment. For example, it is typically desired for
a solution to improve itself over time. When two humans initially get to know each other,
it will be harder for one person to judge the affective state of the other person compared
to when these two persons have known each other for a longer period of time. The same
principle can be expected in a solution for emotion recognition systems. When a system
initially observes a user, it will only have basic emotion recognition capabilities. But as the
observation advances in time, the system can gradually adapt itself to the user and be able
to increase its recognition accuracy.
1The term affection can be defined as the whole of emotions and moods that a person experiences.
2
Chapter 1. Introduction
1.2 Solution Proposal
Taking the above requirements into account, a system that can automatically infer affec-
tive state by analyzing the user’s typing behaviour and mouse movements as well as some
contextual information can be a very promising solution.
Analysis of keystroke dynamics has been proposed as a solution for improved security in
authentication systems. Here, the goal is to identify a user based on his typing behaviour
so that a compromised password does not imply a compromised system anymore. However,
Monrose and Rubin [51] observed that a user’s typing rhythm changed from time to time.
This was attributed to a changing affective state, implying that the affective state can be
derived from keystroke dynamics.
The use of mouse movement analysis is encouraged by Zimmermann [69], who observed that
affect impacts motor-behaviour of computer users. It has also been used as a way to improve
authentication systems just like keystroke dynamics [54].
Automatic emotion recognition using keystroke dynamics and mouse movements have several
requirements that have been set above such as non-intrusiveness and a limited use of extra
sensor hardware. The logging of keystrokes and mouse movements can be done using a specific
piece of software running in the background on the computer making it almost undetectable
by the average user. Thus, the affective state is minimally influenced by the data collection.
Mouse and keyboard are also standard equipment in normal home and office environments and
are very inexpensive. This gives many possibilities for the application of affective computing
solutions using keystroke dynamics and mouse movements in a real-world environment on a
large scale.
1.3 Parts of the Solution
The development of the proposed recognition system can be split up in different parts. First,
a lot of data from different users needs to be collected and labeled with accurate emotional
information. Next, relevant features need to be extracted from the data. All of these features
need to be evaluated and the best features need to be selected to reduce the dimensionality of
the feature space and avoid sparsity of the dataset. Using the selected set of features, models
have to be built and validated to obtain a powerful model that can predict a user’s emotional
state with high accuracy.
1.3.1 Study Environment
To obtain a dataset containing keystroke information and mouse movements during different
emotional states, one needs to make sure that participants experience these emotions. Differ-
3
Chapter 1. Introduction
ent techniques exist to induce particular emotional states. When such a state is induced, the
participant can be asked to perform some actions on a computer that require using the mouse
and keyboard. The mouse and keyboard can then be monitored and information can be
collected. This technique is used in many studies. However, this technique is very intrusive.
Since the focus of this research is on developing a system that can be used in a real-world
environment, this needs to be avoided.
Another technique, which is used in this research, is called experience sampling [29]. Using
experience sampling, participants are recorded during their daily activities. The intention is
to record emotions while in the moment instead of retrospectively (at a later time or another
place). To enforce this technique, software was installed on the participants’ computers. The
software ran as a background process on the computer and was not noticeable except for a
small tray icon. The participants were free to use their computers as usual and once in a
while the software would request them to fill out a small emotional state questionnaire. This
questionnaire required the participants to indicate their emotional state using the pleasure-
arousal-dominance (PAD) emotion model.
1.3.2 Data Collection
As mentioned before, data collection was done using a piece of custom developed software that
ran as a background process when the computer was in use. The software recorded keystrokes,
mouse movements and location information, regardless of the program that was being used
at that moment. Thus, the data collection process was barely noticeable for the participants.
Once the software determined that enough data was collected, the user was asked to fill out
an emotional questionnaire. All collected data and the answers to the questionnaire were then
submitted to a server. A number of settings and features had to be taken into account in the
software to make sure that the answers to the questionnaire had a good correspondence with
the collected data and to make sure that no data was lost or submitted multiple times.
1.3.3 Feature Extraction and Selection
After the field study was completed, all data was collected from the server and needed to be
processed. Different features were extracted from both raw keystroke, raw mouse data and
raw location data. Once a complete set of features was obtained, a selection was made to
reduce the dimensionality and facilitate the machine learning process.
1.3.4 Model Building and Validation
To build a model that can predict a user’s emotional state with high accuracy, the use of Ran-
dom Forest regression models, Random Forest classification models, SVM regression models
and SVM classification models was examined. Two main types of models were built: general
4
Chapter 1. Introduction
models and individual models. A general model entails one model that can predict the emo-
tional state regardless of the user. Such a model would be of great value if it can make highly
accurate predictions. An individual model will build one model for each user using only data
from this user. This concept is based on the assumption that every person is unique and thus
will behave uniquely in different emotional states. To evaluate the models that were built,
10-fold cross-validation was used on the dataset.
1.4 Thesis Outline
Chapter 2 presents a summary of related literature that forms the basis for this research.
Topics that will be discussed are research in emotion assessment, emotion recognition tech-
nology and different environments for emotional experimentation. Next, research in keystroke
dynamics and mouse movements is discussed for both emotion recognition and authentication
purposes.
Chapter 3 presents the data collection process in detail. Both the application of the experience
sampling methodology and the software that was developed are discussed into depth. Different
settings and features that have been used in the software are explained.
Chapter 4 discusses the entire flow from raw data processing to feature extraction and feature
selection.
Chapter 5 describes the different models that were built. This includes the different machine
learning techniques that were used and the measures that have been taken to avoid overfitting
and bias. Finally, all results of the analysis are presented.
Chapter 6 evaluates the outcomes of the results from Chapter 5 and draws different conclusions
from these results.
Chapter 7 finalizes this research by giving a summary, assessing the possibilities for real-
world application of this research, discussing the lessons that were learned and identifying
the contributions that were made as well as the possibilities for future work.
5
Chapter 2
Related Work
In this chapter, the literature that forms the basis for this research is discussed. First, some
common terms in the field of emotion assessment are defined and different techiques that are
used in the research on affect and emotions are explained. Next, some remarks on the use
of keystroke dynamics and mouse movements for authentication and security purposes are
presented. The chapter finishes with a discussion of previous research on keystroke dynamics
and mouse movements and how it is put into the context of emotion recognition research.
2.1 Emotions Theory
2.1.1 Affect, Mood and Emotions
Before discussing techniques to assess emotions, one should have a good understanding of what
is actually meant when using the words ”affect”, ”mood” and ”emotion”. However defining
these terms seems easy, there is little general agreement on actual definitions for these words.
In this research the definition given by Forgas [26] will be followed. Affect is used as a general
term to refer to the combination of moods and emotions. Moods have a rather low intensity,
have a long duration and little cognitive content. Most of the time the subtility of moods
causes persons not to realize they are experiencing them until it is brought to their attention.
As opposed to moods, emotions are much more intense, of a shorter duration and more clear
concerning cognitive content. Emotions can be traced back to a particular cause much easier
and the individual is aware of the presence of an emotion.
2.1.2 Emotion Modelling
To be able to analyze the emotional state of an individual, emotions need to be made mea-
surable. A lot of psychological research has been performed which resulted in a number of
different models to explain human emotional behaviour. Five different models that have been
developed will be discussed: categorical, appraisal, dimensional, circuit and componential
6
Chapter 2. Related Work
models.
Categorical Model
The categorical or discrete model is based on how emotions are described through language.
When explaining emotions, we typically attach specific labels to different emotional experi-
ences. Typical examples are: ”I feel stressed” and ”I feel happy”.
This way, Ekman presented six basic emotions [22]: anger, surprise, happiness, disgust, sad-
ness and fear. He performed research for emotional expressions of the face and how people in
different cultural environments recognize facial expressions. He found that the six proposed
basic emotions were commonly recognizable in most cultures. The six basic emotions were
later expanded to 15 basic emotions [21]: amusement, anger, contempt, contentment, dis-
gust, ebarrassment, excitement, fear, guilt, pride in achievement, relief, sadness, satisfaction,
sensory pleasure and shame.
Determining which emotions are so-called ”basic” emotions is difficult. Emotions can be
primary but also be made of combinations of primary emotions. Defining characteristics of
such primary emotions is thus of critical importance in the categorical model. There is no real
consensus in research literature when it comes to these characteristics and as a consequence,
a number of different sets of categories have been proposed as alternatives for the six (or 15)
basic emotions of Ekman [34].
The fact that emotions are described through language immediately poses an important
problem that needs to be taken into account. As language differs in different parts of the world
this means that emotions will also be described using different languages. Furthermore, it is
possibe that one language contains words to describe an emotional experience that are absent
in another language and vice versa. This can be due to the fact that a word can describe
a more specific feeling. An emotion that is described by one word in a language could thus
have multiple more specific words in another language. The categorization of feelings and
emotions is thus strongly influenced by language and culture.
The discrete model clearly has a number of limitations but nonetheless it is still widely used
in psychology and affective computing research. The reason for the popularity of this model
is its simplicity. Classifying emotions using a set of labels is much easier than using one of
the models that will be discussed next. Almost all existing datasets contain language-defined
class labels. Although the discrete model is very popular in affective computing research, it
is difficult to compare different research results because very often, a different set of emotion
classes is used.
7
Chapter 2. Related Work
Appraisal Model
The appraisal model states that emotions are caused by the dynamic evaluation or appraisal
of events, situations and the environment in general [55]. The relationship of an individual
with its environment is assessed against a number of criteria. Emotions are then felt based
on these assessments. This means that emotions are modelled using the underlying cognitive
processes that precede them. For example, a hostile environment will cause a person to
evaluate the situation as being dangerous and therefore the person will experience an anxious
feeling based on this assessment. The appraisal model accounts for several phenomena that
cause problems for other models. For example, it can account for individual variances in
emotional reaction to the same situation. While the appraisal model can indeed account
for several phenomena, there also exist some issues. For example, it may be possible that
appraisal not only causes emotions but that emotions also cause appraisal. Other research [33]
indicates that emotions may also be caused by processes other than appraisals implying that
appraisals are not necessary causes of emotions.
Dimensional Model
Another very popular model is the dimensional model (also called the circumplex model),
presented by Russell [56]. This model defines emotional states using points in a continuous
dimensional space. It was suggested that continuous space models perform better in out-
of-lab application than discrete models [28]. The used dimensional space can be either uni-
dimensional or multi-dimensional. The PANAS (Positive And Negative Affect Scales) model
is a popular uni-dimensional model [65]. One clear disadvantage of the PANAS model is that
it represents a mixture of emotions, moods and affect.
The PAD (pleasure-arousal-dominance) model is a three-dimensional model that was devel-
oped by Mehrabian and Russell [48] and will also be used in this research. The pleasure
dimension indicates the valence measure of an emotional state. The arousal dimension indi-
cates the level of affective activation of an emotional state. The dominance dimension is used
to indicate the amount of power or control a person experiences in an emotional state. Often,
the PA model is used which is a simpler version of the PAD model leaving out the dominance
dimension. However, this simpler PA model has been criticized [35] because it might not
be possible to fully differentiate between several emotions. Furthermore, it was found after
analyzing data from Bradley & Lang’s experiment [40] that emotional data scattered like a
V-shape and showed some clear holes in the PA space.
8
Chapter 2. Related Work
Figure 2.1: PAD emotion model
Figure 2.2: PA emotion model
Recently, Lovheim [46] proposed a new three-dimensional emotion model that uses the monoamines
serotonin, dopamine and noradrenaline neurotransmitters as dimensions instead of pleasure,
arousal and dominance. Serotonin is closely related to obsession and compulsion, nora-
drenaline is related to alertness and concentration and dopamine is related to motivation.
This way, these neurotransmitters are closely related to human emotion and measuring them
and plotting them into this model could yield an almost direct representation of human emo-
tions.
9
Chapter 2. Related Work
Figure 2.3: Lovheim emotion model [8]
While the dominance dimension in the PAD model may not correspond to an actual physi-
ological system, it does provide the possibility to differentiate among emotions that have a
similar pleasure and arousal values (e.g. anger and fear) [19].
Circuit Model
LeDoux [41] proposed an alternative emotion model that states that emotions are caused by
different neural circuits in the brain. These circuits are determined by evolution. Neuropsy-
chologists have found several so-called survival circuits that cause primitive emotions such as
fear. This way, the circuit model can be related to the categorical model when the activated
circuits are used as labels for the corresponding emotions. The circuit model is less known
and less used than the other models outlined above. It is still limited to explain only primitive
emotions. At the same time however, this model has promising properties because unlike the
previously discussed models, this model is based on objective observations. Activation of
neural circuits can be monitored using biomedical imaging techniques.
2.2 Emotional Experimentation Environment
Different approaches can be chosen to collect the emotion data that is needed. An important
choice is the environment in which the experiment is organized. Two main approaches are
either a laboratory setting or a more naturalistic setting. In this section, both approaches
and their advantages are discussed in detail.
10
Chapter 2. Related Work
2.2.1 Laboratory Setting
In a laboratory setting, moods are induced in the participants and then the desired data is
collected and studied. To induce moods in a participant, one needs a mood induction proce-
dure (MIP). This is an experimental technique to establish a particular mood in a subject.
Westermann et al. [66] listed nine categories of MIPs: imagination, Veltren, film/story, mu-
sic, feedback, social interaction, gift, facial expression and combined MIPs. The film/story
technique and the facial expression technique are discussed next in more detail.
The film/story MIP presents some narrative or descriptive material to the participants to
stimulate their imagination. The participants may identify themselves with certain protago-
nists. The material can be either an elaborate story, a short scene from a film or a description
of scenarios. The material is explicitly selected according to the desired mood to be induced.
Furthermore, this MIP is employed either with or without explicit instruction. When using
explicit instructions, the participant is asked to imagine how it feels to be involved in the
presented situation. This type of MIP is one of the most effective techniques for inducing
both positive and negative moods into participants.
The facial expression MIP is based on the facial feedback hypothesis, proposed by Laird [39].
The expression of the participant’s face is manipulated in order to induce a certain mood.
The participants are instructed on how to contract and relax different facial muscles in order
to produce a frown, a smile, etc. Very often, extra material like a pen is used to enforce the
muscle positioning. The facial expression MIP was found to have a success rate of 50%.
Using MIPs has the advantage of being able to control the moods of the participants. However,
experiments take a considerably long time depending on the technique that is used and it is
not always guaranteed that the induction of the desired mood has succeeded. Furthermore,
individuals may react differently in a laboratory setting compared to a more naturalistic
setting because the experimental situation may influence the individual. Participants may
guess what type of mood is desired by the experiment and adjust their reaction towards
that mood to please the experimenter [47]. This may cause the results obtained by these
experiments not to be useful in real-world environments.
2.2.2 Naturalistic Setting
A naturalistic setting during an experiment aims to observe participants in their natural
setting to have minimal influence on the participant’s behaviour. A self-report recall survey
is one of the possible approaches that can be taken. This requires the participants to record
their experiences after they have occured. However, participants seem to suffer from recall
issues very often which results in possible inaccurate recordings.
Another approach is the experience sampling methodology (ESM) [29]. In this technique a
11
Chapter 2. Related Work
participant’s experiences are recorded as they occur at certain moments in time. This is often
done using some kind of notification that requests the participant to provide responses to
questionnaires at these moments. The experience sampling methodology has the advantage
that it can capture daily life from moment to moment without the problem of recall issues.
It is much easier for participants to indicate their experiences at the moment that they are
experiencing them. However, a high sample frequency causes the participant to be requested
to provide responses very often and could be burdensome and lead to selective non-compliance.
A main disadvantage of using a naturalistic setting is that all techniques provide the exper-
imenter with subjective information. This could lead to biased information and individuals
may repress certain information or change their responses to fit the norm of the participant’s
culture. Despite the disadvantages of the experience sampling methodology, this technique
will be used in this research because the obtained results will be much more useful in real-world
application. Therefore a laboratory setting is not useful to this research.
2.3 Machine Learning
From the gathered data a set of features are selected. In the next step automatic identification
of the emotions is done by applying machine learning using these features. Machine learning is
a branch of artificial intelligence that evolved from pattern recognition and computer learning
theory [9]. It studies the construction of algorithms that offer the ability for computers to
learn without explicitly being programmed. It takes data as input and learns from this data
to make predictions.
A clear distinction is made between three main categories: supervised, unsupervised and
reinforcement learning algorithms. Algorithms in the first category are also commonly called
classification or regression algorithms for discrete or continuous data respectively. They take
labeled data as input, which means that the input data exists of pairs of input vectors and
corresponding output vectors. The goal is to build a model that explains the correspondence
between these input and output vectors. Based on this model, the algorithm can then predict
outputs for new unseen input data. Based on previous research in affect recognition (see
Appendix A), the supervised learning algorithms discussed here are: decision tree (DT),
Bayesian network (BN), naive Bayes (NB), k-nearest neigbors (k -NN), support vector machine
(SVM), random forest (RF) and artificial neural networks (ANN).
Algorithms in the second category contain clustering algorithms and dimensionality reduction
algorithms. They take unlabeled data as input and their goal is to find an underlying structure
in this data. Unsupervised learning is mainly used in pattern recognition and computer
vision. The following well-known unsupervised learning algorithms used for clustering will be
discussed here too: k-means clustering and Gaussian mixture model (GMM).
12
Chapter 2. Related Work
The last category, reinforcement learning works a little bit different. According to the theory
of operant conditioning in psychology, a human can learn something by getting rewarded
for taking certain desireable actions in specific environments and getting punished for taking
undesireable actions in specific environments. This concept is also used in reinforcement
learning. The agent tries to find a sequence of actions that leads to the greatest accumulated
reward. This technique is not discussed here as it is not applicable to the domain.
2.3.1 Decision Tree
Decision tree is one of the oldest machine learning techniques. It is relatively simple to
understand. It tries to find appropriate split values for the different features under observation
such that an optimal set of splits is determined to explain the output vectors. The advantage
of decision tree is that the solution is easy to understand as it is represented by a tree in
which each node represents a feature split and each edge represents a decision based on this
split (see Figure 2.4). Thus, each data point ends up in one of the leaf nodes of the decision
tree and output vectors for new data points are predicted by passing these new data points
through the tree to one of the leaf nodes. The disadvantage of decision tree is that it is
relatively slow to train and has a risk of overfitting (when too many splits are made). A
solution for the overfitting problem is pruning the tree. Specific decision tree algorithms are
Classification And Regression Tree (CART), ID3 and C4.5.
(a) Example of a decision tree
(b) Graphical representation
of the solution generated
by the decision tree
Figure 2.4: Decision tree [3]
2.3.2 Bayesian Network
A Bayesian network is a probabilistic graphical model that uses a directed acyclic graph in
which the nodes represent random variables and edges represent conditional dependencies.
Each node has a probability function that takes a set of values for its parent variables as input
13
Chapter 2. Related Work
and gives the probability of the variable represented by the node as output. For discrete parent
variables, this function can be stored as a table (see Figure 2.5).
Figure 2.5: Bayesian network [2]
The main advantage of Bayesian networks is that they provide a way to make use of the
conditional independencies of the network to save a lot of storage space and calculation time.
For example, in Figure 2.5, the variable S does not depend on R and W and thus, instead of
storing P (S | C,R,W ) which would contain 23 = 8 values we can just store P (S | C) which
contains only 22 = 4 values.
In the simplest case, the Bayesian network is created based on the knowledge about relation-
ships between different variables. Otherwise, techniques exist to learn the network structure
based on data. Also, the parameter values need to be available to know how exactly variables
influence each other. When not available, techniques exist to learn these parameter values.
Bayesian networks can then be used to infer information about unobserved variables based
on observed variables.
2.3.3 Naive Bayes
A classifier gets a vector of feature values as input and yields the corresponding output based
on the training data. To work in a more probabilistic context, a classifier can yield the
probabilities for each possible output instead of just the most likely output. In this case,
the goal is to calculate P (Ck | x1, x2, ..., xn) for each Ck where k is the number of possible
outputs and n is the number of features. Using Bayes’ theorem, this can be rewritten asP (Ck,x1,x2,...,xn)P (x1,x2,...,xn)
. The denominator does not depend on Ck and thus, since the feature values
are given, is constant. Using the chain rule for conditional probability the nominator can be
14
Chapter 2. Related Work
rewritten as P (Ck)P (x1 | Ck)P (x2 | Ck, x1)...P (xn | Ck, x1, x2, ..., xn−1). It is clear that for
large n or a lot of possible feature values, this calculation becomes a very difficult task.
Naive Bayes makes the naive assumption that all features are conditionally indepent of each
other given the output variable. This means that the naive Bayes technique takes every
feature into account as having an effect on the output variable but not on any other feature.
This can be illustrated with a simple example. When we observe a medical cancer experiment
and take two features: smoker and pneumonia. Both of these features can be indicators for
the presence of the output variable cancer. We know that someone who smokes is more likely
to have pneumonia. If we want to know the probability of a person having cancer, we should
calculate P (C | S, P ) and thus more specifically P (C)P (S | C)P (P | C, S). However, Naive
Bayes makes the assumption that smoking does not influence pneumonia and thus rewrites
this as P (C)P (S | C)P (P | C).
The naive assumption of conditionally independent features causes the calculations to become
simpler and thus causes algorithms to be much quicker. Despite the apparantly oversimpli-
fied model, naive Bayes works quite well in many complex situations. However, it is still
outperformed by some other techniques.
2.3.4 k-Nearest Neighbors
The k-Nearest Neighbors algorithm saves all training data, this means all feature vectors and
their corresponding output. When a new instance is presented for which a prediction should
be made, the algorithm calculates the distances from this new instance to all training data
points and selects the k nearest points. Then, for classification, a majority voting is calculated
over these k nearest points to obtain a class prediction for the new instance. For regression,
the average is calculated over these k nearest points.
Figure 2.6: k-Nearest Neighbors: for k=3, the prediction for the new instance is class B; for k=6,
the prediction is class A [6]
15
Chapter 2. Related Work
The algorithm has no parameters except for the value of k. However, choosing the value of k
is important as is illustrated in Figure 2.6. A k-value that is too high does not offer enough
sensibility for detail while a value that is too low is too sensitive in a small environment of the
new instance. Furthermore, instead of just taking the k nearest training data points, one can
also use some weight function to take data points that are closer to the new instance more
into account than points that are far away. Combining this with a higher value for k could
compensate for the disadvantages discussed earlier.
During training, the algorithm only saves the training data points but does not perform
any calculation. All calculations are performed during classification. Therefore, the k -NN
algorithm is called an instance-based or lazy learning algorithm.
2.3.5 Support Vector Machine
The Support Vector Machine algorithm analyzes multi-dimensional training vectors that con-
tain feature values and tries to construct a set of hyperplanes that can be used for classification
of the vectors. A good set of hyperplanes is achieved when these hyperplanes cause the data
of different classes (each on a certain side of the hyperplanes) to be separated maximally and
thus that the generalization error is minimal. This principle is illustrated in Figure 2.7.
Figure 2.7: Support Vector Machine [12]
The general SVM concept as explained until now is very interesting but cannot handle data
that is not linearly separable. To cope with this problem, there exists a technique called the
kernel trick. This technique transforms the data to a higher dimensional space where the
data might be linearly separable (see Figure 2.8a). Applying the normal SVM concept in this
space then yields a set of hyperplanes that can be transformed back to the original space (see
Figure 2.8b). Thus, using the kernel trick it is possible to learn a nonlinear SVM while still
using the linear formulation of SVM.
16
Chapter 2. Related Work
2.3.6 Random Forest
Random Forest can be considered as an ensemble algorithm. It uses many decision trees
for training and combines their outputs. More specific, it divides the training data into
many subsets. To construct each subset a number of samples are randomly selected with
replacement. This technique is called tree bagging and is also used by Random Decision Trees.
Furthermore, during the learning process of each tree a random subset of features is selected
at each candidate split. The outputs of all trees are then combined using a majority voting
and presented as the output of the Random Forest algorithm. This algorithm is illustrated in
Figure 2.9. It usually has a higher accuracy and robustness than normal element classifiers,
can be highly parallellized and can be used for very large datasets and a large number of
features. Random Forests correct the tendance of decision trees to overfit and can be used to
increase the estimation of relative feature importances.
Figure 2.9: Random Forest [11]
2.3.7 Artificial Neural Network
Artificial neural networks are a family of algorithms that try to estimate or approximate
functions that are usually unknown and often depend on a large number of inputs. They
are inspired by biological neural networks in the brain and try to mimic their functionality.
An artificial neural networks consists out of neurons that are interconnected. Through these
connections they can communicate. A network has a number of input neurons and one or
more output neurons. In between there can be so-called hidden nodes. A layer of hidden
nodes is called a hidden layer. This structure is illustrated in Figure 2.10a. Each neuron thus
has a number of inputs and an output. The inputs of each neuron can be weighted based on
knowledge and experience. The output of each neuron is calculated by a function using the
17
Chapter 2. Related Work
inputs of the neuron. Different types of neuron functions exist but one that is used very often
is the sigmoid-function. A neuron is illustrated in Figure 2.10b.
An artificial neural network is very interesting because of its flexibility. It can deal with
highly-dimensional data of both contiuous and discrete nature and is capable of non-linear
classification. It can be used for both classification and regression and has a relatively low
algorithm complexity. Disadvantages are that there is no real guideline for determining the
amount of neurons and layers that should be used. Using too many neurons and layers will
have a negative effect on the performance while using too little neurons and layers will cause
a decrease in accuracy. The classifier that is generated by an artificial neural network is also
very hard to interpret by humans.
Some of the disadvantages of classic artificial neural networks can be countered by using deep
learning techniques. Deep learning algorithms use artificial neural network architectures with
many hidden layers. Deep learning algorithms assume that data is generated by many different
underlying factors on different levels and that these factors are organized into multiple levels of
abstraction. They offer the possibility to replace manually crafted features by an unsupervised
way of feature extraction that discovers the best network structure for the given data.
2.3.8 k-Means Clustering
The k-means clustering algorithm is an unsupervised learning algorithm that tries to partition
unlabeled data vectors into k different clusters, in which each cluster has a mean vector. The
algorithm is initialized with k initial mean vectors, selected from the dataset, forming k
clusters. This initial set of mean vectors is often chosen randomly. Next, each other data
vector in the dataset is assigned to the cluster of the nearest mean vector. Then, for each
cluster the new mean vector is calculated and the assignment step is repeated. These two steps
are repeated for a finite number of times or until the cluster means don’t change anymore
and the algorithm converges. The algorithm is illustrated in Figure 2.11. It is clear that
the quality of the solution of this algorithm depends a lot on the choice of the initial set
of mean vectors. Therefore, the algorithm is often run multiple times, each run having a
different initial set of mean vectors and at the end for each data vector a majority voting
is performed over the different solutions. Another property that influences the quality of
obtained solutions is the choice of k. This value should match the data on which clustering is
performed. Furthermore, k-means clustering works well when the different clusters are well
separated from each other, otherwise it will probably not be able to distinguish them. It is
also not robust against noise and outliers and fails for a non-linear dataset.
18
Chapter 2. Related Work
Figure 2.11: k-Means Clustering [5]
2.3.9 Gaussian Mixture Model
The Gaussian Mixture Model is a probabilistic model that assumes that all data points are
generated by a weighted mixture of a finite number of underlying Gaussian distributions.
The parameters of these distributions are unknown and thus need to be learned from training
data. It can be viewed as a soft version the k-means clustering algorithm. The algorithm
that is used to train Gaussian Mixture Models is called the EM-algorithm (Expectation-
Maximization). This algorithm works in about the same way as the k-means clustering
algorithm. It starts with a set of initial parameters and creates a function for the expectation
of the log-likelihood given these parameters. Then it adapts the parameters to maximize the
expected log-likelihood found in the expectation step. These two steps are then repeated for
a finite number of times or until the algorithm converges to a certain solution. This algorithm
is illustrated in Figure 2.12. The same problem of choosing the right initial parameters can
be found here. Very often, the k-means clustering algorithm is used several times first as
described above and its result is used for the initial parameters of the EM-algorithm.
19
Chapter 2. Related Work
Figure 2.12: EM-algorithm [4]
2.4 Keystroke Dynamics
Keystroke dynamics is the study of the characteristics that are present in a user’s typing
rhythm when using a keyboard. It typically focuses on timing characteristics of typing to
identify patterns in the data. Most often this includes the analysis of duration of a keypress
or group of keys and the latency between consecutive keys. This classic set of features can
be expanded with other features that not only give information about ”how” the user is
typing but also about ”what” the user is typing. The way that humans type is influenced
by many factors. Except for individual differences, typing behaviour is also influenced by
the context in which the user is typing. This can be used as a feature because the typing
rhythm may highly depend on whether the user is typing in an instant messaging application
or a word processing application for example. Furthermore, an individual’s typing behaviour
can change when the individual experiences different emotions. This means that information
about a user’s typing behaviour may allow the inference of the user’s emotional state. This
is exactly the objective of this research. In this section some terminology is presented that
is used in keystroke dynamics literature and different approaches that can be followed in this
area are examined. Furthermore, some of the most common features that are used in related
literature are discussed in more detail.
2.4.1 Fixed Text Analysis
A first approach, called fixed or static text analysis, usually requires participants to type one
or more fixed pieces of text multiple times during the data collection process. Using fixed
text to build a model implies that this model can only be used at those moments that the
user types one or more of the fixed pieces of text on which the model was trained. Fixed
20
Chapter 2. Related Work
text studies typically require participants to enter the fixed text in some predefined text box
while the keystrokes are monitored. This approach is typically useful in authentication and
security applications as it can be used on passwords.
2.4.2 Free Text Analysis
Another approach, called free or dynamic text analysis, does not require the user to type
the same piece of text each time during data collection but allows any sequence of text as
input. Using a dynamic approach enables the model to be used during continuous monitoring,
which is very desireable for affect recognition because this means that emotional information is
available in the computer system at any time. Free text studies typically involve participants
entering any text they like in a predefined text box while the keystrokes are monitored.
However, this is not necessarily the case. It is also possible to use software that monitors all
keystrokes in any application. This approach has the benefit that the participant does not
have to think about what he will type first before actually typing. Another advantage is that
this way, the application context can be monitored too. Furthermore, using this approach
it is also possible to obtain the keystroke data unobtrusively, which is also highly desireable
for affect recognition. These benefits form the motivation for using free text analysis in this
research instead of fixed text analysis.
2.4.3 Keystroke Features
The most commonly used features in related literature are timing features. These include
features that are calculated on both individual keys as well as multiple keys. A first typical
feature is the key duration. This is the time elapsed from the moment that the key was
pressed by the user (keydown event) until the time that the key was released (keyup event).
A keyup event of a certain key must not necessarily follow the corresponding keydown event
directly. For example, when typing an uppercase character the first event is a keydown event
for the Shift key, followed by the keydown event of the character key, then followed by the
keyup events of both keys in any order. Another case in which this is possible is when a
user is typing very fast. In the case of typing two characters very fast, it is possible that the
second character key is pressed before the first character key is released. These cases must
be taken into consideration when analyzing the collected keystroke data.
The key duration feature has also been used on multiple consecutive keys or graphs. A digraph
contains two consecutive keystrokes, whereas a trigraph contans three; this continues for any
number of keystrokes, which creates n-graphs.
A second typical feature is the digraph latency. This is the time elapsed from the keyup
event of a key to the keydown event of the next key. Taking into account the considerations
21
Chapter 2. Related Work
mentioned above, the latency can be negative when the keyup event of a key comes after the
keydown event of the consecutive key.
When deciding which features to use, it is important to know that larger n-graphs are less
likely to occur in free text and thus cause sparsity when used as features. Using only timing
features has the benefit of maintained privacy as the actual content of the text is not used.
However, this data very likely contains valuable information and therefore it will be used in
this research.
2.5 Text Content Analysis
The actual content of text contains a lot of valuable information concerning affect. The
starting point of a linguistic analysis of text to extract emotional information from it is the
use of specific affective lexicons. An interesting lexicon is the Affective Norms for English
Words (ANEW) [18]. This lexicon presents a set of verbal materials that have been rated
in terms of pleasure, arousal and dominance. SentiWordnet [24] is another lexical resource
that keeps information on the polarity of subjective terms. For each term three scores are
calculated: objectivity O, positivity P and negativity N . The scores are related to each other
as follows: O = 1 − (P + S). WordNet Affect [60] is a third lexicon that was motivated
by the need for a lexical resource containing explicit fine-grained emotional annotations.
Clore et al. [20] argued that words need to be distinguished based on the fact whether they
directly refer to emotional states or contain an indirect reference that depends on the context.
WordNet Affect is an extension of the WordNet database [49].
A more abstract analysis using specific textual features is also possible. Vizer et al. [64]
used a number of features defined by Zhou et al. [68]. However, some of these features are
fairly straightforward to use (e.g. lexical diversity, content diversity, average word length,
average sentence length) while others still require some extra (possibly manual) semantical
information on word types. For example, a possible feature would be the rate of self-reference
words (e.g. me and I) but this assumes that the system knows that these words are self-
references, which is not trivial as some mapping will be needed. Other possible features are
special punctuation and uppercase word rate [14].
You can also make use of the ISEAR (International Survey on Emotion Antecedents and
Reactions) dataset that contains reports of situations accompagnied by the corresponding
emotions that were experienced in these situations. The possible emotions are joy, fear, anger,
sadness, disgust, shame and guilt. The ISEAR project was directed by Klaus R. Scherer and
Harald Wallbot. This dataset was used to train a classifier that can then be used on unseen
pieces of text. The training process of this classifier was done using a vector space model.
Such a model assigns a weight to every term in a document. To calculate these weights,
22
Chapter 2. Related Work
several schemes exist. A commonly used scheme is TF-IDF weighting (Term Frequency -
Inverse Document Frequency). To calculate this weight for a term in a particular document,
the term frequency for that document is calculated and then the inverse document frequency
is calculated by taking the logarithm of the division of the total number of documents in
the dataset and the number of documents containing the term for wich the weight is being
calculated. After normalization, the formula for the weight becomes wi = tfi∗log(N/di)∑nj=1 tf
2j ∗log(N/dj)
2
where wi is the weight of the term, tfi is the term frequency of term i, N is the total number
of documents in the dataset, di is the number of documents containing term i and n is the
total number of unique terms in the dataset. All term weights can then be calculated and all
documents can be converted to vectors in the vector space model. This way, each emotion
has several corresponding vectors and a classifier can be trained on this data. This classifier
can use several similarity measures. Nahin et al. [52] achieved better accuracy using Jaccard
similarity than using cosine similarity.
2.6 Mouse Behaviour
Except for text content and keystroke dynamics, also mouse behaviour may contain useful
information on an individual’s emotional state. Modelling mouse behaviour requires capturing
mouse events. One can distinguish single clicks, double clicks, either using the left, right or
maybe middle button. Furthermore, one can observe mouse movements and mouse wheel
movements. From the mouse movement data, the distance, angle and speed features between
pairs of data points can be extracted. Usually, it does not make much sense to calculate these
features for each pair of consecutive data points. Instead, a frequency parameter is used that
defines the separation between observed pairs of data points. Also the time that the mouse
is inactive can be used as a feature. From these features, one can calculate means, standard
deviations, moment values and frequencies over windows of N data points.
2.7 Authentication and Security
The interest in keystroke dynamics originated after a study by Gaines et al. [27] who observed
that individuals seem to have unique typing behaviours. It was suggested to use this infor-
mation in authentication systems to provide an extra security measure. In these applications,
the goal is to identify a user by analyzing his typing pattern. This principle is very analogous
to the usage of handwritten letters and signatures. This goal can be achieved by asking a
user to type a fixed sequence of keys (possibly multiple times) and analyze the same timing
features discussed above. This fixed sequence of keys could be a password for example. When
a user wants to authenticate himself, he then needs to supply the correct password but also
supply it in the correct way. However, it is also possible to use a free text approach. This
is especially interesting because a system could be used that continuously monitors the user
23
Chapter 2. Related Work
during an authenticated session and detects when a possible intruder starts using the system.
One of the first studies in the area of authentication using keystroke dynamics was performed
by Monrose and Rubin [50]. They collected keystroke data from several participants that
were asked to type some fixed sentences as well as some free text. They tried different
distance metrics using the k-nearest neighbor algorithm and achieved a correct identification
rate of approximately 90% using a weighted probabilistic classifier. They observed that free
text classification did not perform as well as fixed text classification and attributed this to
variations in operational conditions in which the user may be absorbed (e.g. emotionally
charged situations) [51].
Fairhurst et al. [25] studied the effect of multiclassifier systems on the accuracy of identity
prediction based on keystroke dynamics. They used a k -NN classifier, a decision tree algo-
rithm and a naive Bayes classifier. They combined these classifiers using DCS-LA (Dynamic
Classifier Selection using Local Accuracy), majority voting and summation. These multiclas-
sifier systems yield much higher accuracies than the single classifiers do. Furthermore, they
achieved accuracies of more than 95% for gender prediction. This is an important observation
for this research as this indicates that it is possible to distinguish between men and women,
implying that this classification will make emotion classification easier too, assuming that
emotional behaviour is also related to gender.
Some interesting work in the area of identification using free text keystroke dynamics has
been done by Traore et al. [61].They propose a new approach that combines monograph and
digraph analysis and uses a neural network to predict missing digraphs based on the relation
between the monitored keystrokes. They achieved a false acceptance rate (FAR) of only
0.0152%, a false rejection rate (FRR) of 4.82% and an equal error rate (EER) of 2.46% in
a heterogeneous environment. The results for an homogeneous environment are even better
but are less reliable due to a low number of samples that were used.
An important observation in keystroke dynamics research for authentication purposes is that it
is possible that an individual’s typing rhythm changes throughout the course of his life. There
have also been many reports on nonneglibible variances in an individual’s typing behaviour.
This can be attributed to both physiological and psychological factors. This last factor is
especially of great importance in this research.
Also mouse behaviour can be used in authentication systems. Pusara et al. [54] presented an
approach to user re-authentication based on only mouse behaviour data. This is an example
of continuous analysis and the detection of anomalies, leading to a re-authentication request.
They used the C5.0 decision tree algorithm using the mouse features discussed above and
have achieved a FAR of 0.43% and a FRR of 1.75%.
24
Chapter 2. Related Work
2.8 Emotion Recognition in Computer Systems
In this section, related work in the specific domain of emotion recognition in computer systems
will be discussed. A summary of the methodologies, techniques and results of some of this
previous work will be given. This information is also put in a concept matrix in Appendix A.
Some of the first work that was done in this area can be attributed to Zimmerman et al. [70].
They described a method to correlate human keyboard and mouse interaction with affec-
tive states. They used movie clips to induce different affective states in the PA emotion
model. Participants had to shop on an e-commerce website for office supplies while several
physiological parameters were measured such as respiration, pulse, skin conductance level
and corrugator activity. During the experiment all mouse and keyboard actions were regis-
tered too. Later [69], they were able to show, using this methodology, that affect impacts
motor-behaviour of computer users.
Vizer et al. [64] proposed a new way of assessing human cognitive and physical stress by
analyzing keystroke dynamics using both content and timing features. This is based on the
observation of variability and drift in an individual’s typing pattern which has been attributed
to situational factors as well as stress and fatigue and thus to changes in cognitive and physical
function. They collected free text keystroke samples over multiple sessions for each participant
consisting of baseline and control samples under no stress, samples under cognitive stress
and samples under physical stress. Participants had to describe their stress level after each
session using an 11-point Likert scale. Cognitive stress conditions were induced using mental
multiplication and three-back number recall. Physical stress was induced by walking on a
treadmill and performing biceps curls. They used different machine learning algorithms and
achieved correct classification rates of 62.5% for the physical stress condition and 75% for the
cognitive stress condition. They stated that their results are comparable to other affective
computing techniques for stress detection.
Epp et al. [23] investigated the possibility to identify different emotional states using keystroke
analysis. They focused on gathering keystroke data in a natural context using the experience
sampling methodology rather than a laboratory environment. Each participant installed a
program running in the background to collect free text. Based on the participant’s activity, the
program prompted the participant to fill out a questionnaire that contained 15 5-point Likert
scale questions, each one regarding one of the following emotional states: anger, boredom,
confidence, distraction, excitement, focused, frustration, happiness, hesitance, nervousness,
overwhelmed, relaxation, sadness, stress and tired. After filling out each questionnaire, the
participant had to enter a piece of fixed text. A large number of features were extracted from
this data: duration times for single keys, digraphs and trigraphs; latency times for digraphs
and trigraphs; times between pressing two successive keys in graphs; number of events in
25
Chapter 2. Related Work
graphs; number of characters, numbers, punctuation marks and uppercase characters. They
also used outlier removal, feature selection and undersampling to avoid class skew. The C4.5
decision tree algorithm was used to build a binary classifier for each of the 15 mentioned
emotional states. The classifiers for free text did not perform well so only the results for fixed
text were included. They achieved reliable accuracy rates ranging from 77.4% to 87.8% for
the confidence, hesitance, nervousness, relaxation, sadness and tiredness states.
Alhothali [13] used a dialogue-based tutoring system to spontaneously induce affective states
that relate to learning: delighted, neutral, confused, bored and frustrated. The experiment
was performed in a laboratory setting and collected timing, typing and response features
during the participant’s interaction. The participants had to determine their emotion after
each response using 5 statements on a 5-point Likert scale. Two external judges also observed
the participants and provided their responses for the participants. The classification was
divided in two stages. First, classification of the emotional valence was performed. This
classification method yielded an accuracy of 82.82%, 72.02% and 77.2% for the user-labeled,
judge1 and judge2 datasets respectively using an artifical neural network. The classification
accuracy for the specific emotions only achieved an accuracy of 53.59%, 45.6% and 53.89%
for the user-labeled, judge1 and judge2 datasets respectively, also using an artifical neural
network.
Tsoulouhas et al. [62] introduced a method to detect student boredom during the attendance
in an online lesson that was followed in a laboratory setting. They extracted features from the
movements of the user’s mouse such as inactivity timings, speeds and direction of movements.
These features were then used as input to the C4.5 decision tree algorithm. They also used
the content features of the different learning objects. Each time after a certain period of
inactivity the user is prompted with a pop up dialog which asks the user to indicate whether
he is bored with a single ”Yes” or ”No”. The dialog appears only when the mouse is moved
again so that the duration of the pause is not influenced by the system. They achieved a
correct classification rate above 90%.
Except for analyzing keystroke and mouse features, it is also possible to use other additional
sensors that are present in today’s smartphones. LiKamWa et al. [44] performs an exper-
imental study, which they extended in [45], to classify different moods using this kind of
data. As denoted above, moods are less intense and last longer than emotions. They use the
circumplex emotion model to quantify moods and use frequency features from application
usage, phone calls, SMSes, emails, web browsing history and location. All data is collected in
a natural environment. A very high accuracy rate is achieved for individual models and they
observed that when using some of the chosen features individually, a similar or slightly worse
accuracy is achieved compared to when all features are used. This indicates that only a subset
of features is responsible for the high accuracy. However, this specific subset is dependent on
26
Chapter 2. Related Work
the specific user that is observed. Just like in the research of Epp et al. [23], the acquired
dataset suffers from class skew.
Continuing in the area of smartphones, Lee et al. [42] investigated the possibilities to auto-
matically recognize emotion in social network service posts so that correct emoticons could
be automatically added to it. Participants were asked to write short messages in a natural
environment reporting their emotional state. Seven emotions were predefined: happiness,
surprise, anger, disgust, sadness, fear and neutral. Initially 14 features were used. Some of
them were keystroke features such as typing speed and frequencies of pressing backspace, en-
ter and special symbols. Other features described the content of the messages that were sent.
Furthermore, additional features were used such as touch count, long touch count, device
shake count, illuminance, discomfort index, location, time and weather. Of these 14 features
10 were selected that correlated the most with the emotion information to build an inference
model. A Bayesian network was used as classifier. After the emotion of a text message was
recognized, the user decided whether the emotion was accurate or not and provided feedback.
The model was then updated immediately. An average classification accuracy of 67.52% was
achieved but this strongly depended on the type of emotion. The best recognition rates were
achieved for happiness, surprise and neutral. However, they have found a general correlation
between the recognition accuracy and the amount of observation cases for each emotion. This
indicates that more data is necessary for emotional states for which the accuracies are lower.
Tsui et al. [63] validated the hypothesis about the existence of the difference in typing patterns
between different emotional states using the facial feedback hypothesis. They performed an
experiment in a laboratory setting in which they asked participants to type a fixed number
sequence under different emotional states induced by the facial feedback. One state induced
positive emotion and another one induced negative emotion. During the experiment, the
keystroke data was recorded. They used features such as duration and latency. The results
supported the initial hypothesis about the relation between typing patterns and emotional
states.
Nahin et al. [52] tried to detect emotions by analyzing a combination of keystroke dynamics
and textual contents typed by a user. They distinguished seven emotional classes and used
both fixed text analysis and free text analysis. Two additional classes were used (neutral and
tired) in case the user is not in any of the seven classes. They extracted 19 keystroke features
and 7 of them were actually selected. For textual content analysis the ISEAR dataset and
WordNet was used for text pattern analysis. The text is converted to a vector using the
vector space model. Different algorithms were used to train models for each emotional class
and relatively high accuracies were achieved for fixed text analysis. The results for free text
were less satisifying but still better than chance.
Hernandez et al. [30] present a method to evaluate a user’s boredom and frustration in an
27
Chapter 2. Related Work
intelligent learning environment. In contrast to [62] they mainly focus on free text analysis
and perform experiments in a natural environment. They extract commonly used keystroke
features based on previous research and also take mouse dynamics into account as extra
features. Furthermore, a genetic feature selection algorithm is used to obtain a good feature
subset and the k -NN algorithm is used for classification. This improves the classifier results
and obtains a correct classification rate of about 83% and 74% for boredom and frustration
respectively. These rates are similar to fixed text analysis classification rates obtained in
other research.
Another research conducted by Lee et al. [43] examines the source of variance in keystroke
typing patterns caused by emotions using visual stimuli to induce emotional states. They
used the International Affective Picture System (IAPS) in a laboratory setting and asked the
participants to type a fixed number sequence. The keystroke data was recorded during the
typing task and afterwards a self-assessment manikin had to be submitted. The results of
this manikin were translated into three levels of the ANOVA factors valence and arousal. The
results of the experiment indicate that the effect of emotion is significant in the keystroke
duration, latency and accuracy rate of the keyboard typing. However, the size of the emotional
effect is small compared to the individual variability.
Shukla et al. [58] did some research that is comparable to previous experiments but proposed
the usage of fuzzy logic. Bakhtiyari et al. [15, 16, 17] describe a fuzzy model for multi-level
human emotion recognition through keyboard keystrokes, mouse and touch-screen interac-
tions. The reason for using a fuzzy model is based on the assumption that emotions have a
fuzzy basis and that human beings may have different emotions at the same time in different
levels. They classified emotions into five different levels and used and experience sampling
methodology to collect their data. Participants were asked to indicate their levels of different
emotions every 4 hours. One important observation was that the emotions indicated by men
were stronger on average than the emotions indicated by women. They used the support
vector machine algorithm to classify emotions and they were able to increase the detection
accuracy up to 5% compared to non-fuzzy models.
Salmeron et al. [57] compared different emotion labeling methods and combinations of these
methods to see how accuracies can be improved. They compared taking a self-assessment
manikin provided by the user, a manikin provided by expert psychologists, a categorical ap-
proach using PANAS and a combination of the manikin provided by the user and the manikin
provided by the psychologists. Furthermore, they compare different data sources: keystroke
analysis, mouse analysis, physiological analysis and sentiment analysis. Data collection was
performed in a laboratory setting using free text. Their results indicate that sentiment anal-
ysis yields the highest accuracy. The algorithm and labeling methods that yield the best
results depend on the data source that is used.
28
Chapter 2. Related Work
Kolakowska recently reviewed a lot of the research that has been done so far in the area of
emotion recognition in computer systems [37, 38]. The most important conclusions for future
work are that the data collection stage takes a very long time and that this should be taken
into account during the experiment design. Furthermore, it seems that individual models
will perform much better than general models. It is often very hard to compare different
studies because they use different datasets, emotion models, feature sets and algorithms.
The used datasets are usually not made publicly available which prevents the improvement
of existing research results. Moreover, recognition accuracy seems to vary among different
emotional states. It is crucial to obtain enough data to obtain reliable results. Especially
in the case of individual models, data of different users cannot be combined and thus each
user needs to provide a large amount of data. A number of interesting possible applications is
presented [36] including possibilities for optimizing software usability, improving development
processes, education, websites and video games. For each of these possible applications,
various research methodics are proposed as well as the possible challenges that need to be
overcome.
29
Chapter 2. Related Work
(a) Transform the data to a higher dimensional space so that it becomes linearly separable
(b) Apply the SVM technique and transform the solution back to the original space
Figure 2.8: Kernel trick [7]
30
Chapter 2. Related Work
(a) General structure [1]
(b) Neuron [10]
Figure 2.10: Artificial neural network
31
Chapter 3
Data Collection
This chapter discusses the first step of the research. Before being able to analyze data and
start building models for emotion recognition, data needs to be obtained. A field study was
performed using custom built software. In the first part of this chapter the details of this
field study are presented. Then the actual software that was built is discussed.
The field study was designed to gather participants’ keystroke, mouse, location and weather
data together with subjective indications of emotional states in the pleasure-arousal-dominance
model. The study is conducted in a naturalistic setting while participants perform their daily
tasks using an experience sampling method. This approach is more representative for a
real-world context and allows to capture mostly uninfluenced data. The participants are pe-
riodically requested to indicate their emotional state while other data is continuously collected
in the background. The data collection process and scheduling of emotional state requests is
done by a piece of software that is installed on the participants computer.
This approach imposes some disadvantages compared to a more controlled approach. It is
almost impossible to control emotional states as this requires eliciting emotional responses,
which is hard to achieve in an uncontrolled environment. This has as a result that there are
no real guarantees for clean and correctly labeled data. Therefore, to be able to create models
with sufficient predictive power, more data needs to be gathered when using an uncontrolled
approach. Again, this is difficult to enforce in an uncontrolled environment as participants
are not obligated to spend certain amounts of time on the computer and provide a sufficient
amount of data. This requires the study to run over a relatively long period of time, preferably
using participants that spend a lot of time on the computer.
Controlled lab studies have the disadvantage that they can be very time-consuming which
means that to collect a sufficient amount of data, participants need to attend multiple ses-
sions. Thus, it will be more difficult to find participants that are willing to partake in the
study. Additionally, this could be very costly as it may require more administration and com-
32
Chapter 3. Data Collection
pensation. Furthermore, often only a small set of variables are used which may be a limiting
factor for exploratory research in order to determine which variables are of interest.
The usage of an experience sampling approach was chosen over a controlled laboratory ap-
proach after weighing the relative advantages and disadvantages of each approach.
In this research, the pleasure-arousal-dominance model for emotion indication is chosen as it
is a general emotion model with little or no downsides. This may allow the determination of
which emotional states are most suited for detection using this approach while still allowing
participants to use other emotional states. Furthermore, this model lends itself very well for
mapping onto other models such as a categorical model. Participants were periodically asked
to indicate their emotional state using three scales (one for each axis of the PAD model) that
ranged from 0 to 100, where 0 stands for the most negative value on the axis and 100 stands
for the most positive value on the axis. Note that these numbers were not shown on the
scales so a participant would not take any numerical value into account. This enforces a more
intuitive way of emotion indication.
When asked to indicate their emotional state, participants were not obligated to do this.
When a request was ignored by the participant, the collected data was transmitted anyway
without a corresponding emotional state label. Unlabeled data can be used to create a profile
for each participant that contains information about its average behaviour.
The data that was gathered using this study contained a lot of raw elements which need
further processing. For example, in theory the implemented emotion model allows for 1013
different emotional states. It is possible to perform regression or to perform discretization by
reducing multiple values to a smaller number of classes. For example, values that lie at the
outer ends of each scale could be taken together with more moderate values on the same scale
if it is observed that participants are less inclined to choose extreme values. Furthermore,
keystroke and mouse data are collected with corresponding timestamps. However, the time
differences between the keystrokes and mouse movements might be more interesting. This
way, there will be many other needs for processing the raw data. This will be discussed in
Chapter 4.
3.1 Field Study
The field study was conducted from November 15th, 2015 until May 1st, 2016 with 14 par-
ticipants contributing data for, on average, 22 weeks. All participants were recruited on our
personal request. No incentives were used for recruiting participants.
33
Chapter 3. Data Collection
3.1.1 Set-up
Upon signing up for the study, each participant had to provide some registration informa-
tion (including some profiling information). This was done through a website that was pro-
grammed in PHP1, which facilitated automated processing and the remote administration of
the study. The registration information included the participant’s first and last name, e-mail
address, gender, birthdate, place of birth, occupation, education, nationality, first language,
most used language on the computer, dominant hand, typing skills (low, intermediate or
high), computer skills (low, intermediate or high), percentage of total computer time spent
on the concerned computer, keyboard layout, mouse type and computer type. After submit-
ting this information, the participant was directed to a consent form that needed to be agreed
upon by the participant. The agreement was enforced by an explicit check box that needed
to be checked before being able to continue the signup. The IP address of the participant’s
computer and the timestamp when the user was registered was collected and a unique user
id was generated using the MD52 hashing scheme. All participant information was sent over
a secured TLS/SSL3 connection and kept securely in a MySQL4 database on the server that
hosted the website. After having agreed with the consent form, the participant was directed
to a download page where a simple installer file could be downloaded and clear installation
instructions could be found. All participants were guided through this process as much as
needed. The three pages described above can be found in Appendix B, C and D.
Furthermore, each participant was added manually to a dedicated university mailing list which
was used to be able to quickly send out communication to all participants at once.
3.1.2 Restrictions
The participants were not restricted in any way during the entire course of the study. There
were also no restrictions on the language that could be typed during the study as most par-
ticipants often need to use both Dutch and English. The operating systems that participants
could use were restricted to different versions of Windows. All this because the participants
were required to install platform-dependent data collection software on their computers. This
also restricted the study to only those participants that use the Windows operating system.
The different versions of Windows that were used are Windows 7, Windows 8, Windows 8.1
and Windows 10.
Both laptop and desktop users could partake in the study and external devices could also be
used. This implies that keystrokes may originate from either a laptop keyboard or an external
1PHP: Hypertext Preprocessor https://www.php.net/2MD5: https://en.wikipedia.org/wiki/MD53TLS: Transport Layer Security https://en.wikipedia.org/wiki/Transport_Layer_Security4MySQL: https://www.mysql.com/
34
Chapter 3. Data Collection
keyboard. The same holds for the pointing device movements. These may originate from a
trackpad or an external mouse. However, upon signing up for the study, participants had to
indicate which sources they would be using as well as the keyboard layout that they were
using (AZERTY/QWERTY/...).
3.1.3 Privacy Issues
The sensitivity of the data that was collected at first caused most participants to be appre-
hensive about partaking in the study. Most participants were not eager to be monitored
continuously. As the data collection software actually is a very advanced key logger this is
often associated by participants with malware and privacy invasion. To make sure partici-
pants were comfortable with running the software on their computers, a detailed explanation
of the data collection process and the data storage principles was given individually to each
participant and they had the opportunity to ask questions. A number of guidelines from an
ethical commission were also taken into account. This way, all data was collected anony-
mously using a code that cannot be linked back to the name of the participant after the
study. Every participant also needed to agree with an informed consent form, as mentioned
before. Furthermore, certain guarantees were made to assure safe transmission of the data to
the server.
However, not all persons that were asked to participate in the study felt comfortable after
these extra safety measures and explanations and decided not to partake in the study. A
laboratory study in which participants are aware that they are being monitored would prob-
ably not pose such problems as the participant is able to control his behaviour and can avoid
passing sensitive information. However, the ’hidden’ aspect of advanced key logger software is
exactly what makes using keystroke dynamics and mouse movements so interesting for emo-
tion recognition as the user will not really think about the fact that he is being monitored.
Therefore, the participant’s emotional state will only be minimally influenced by knowing
that they are being monitored.
3.1.4 Meantime Study Evaluation
On December 26th, 2015 all participants received an email (through the mailing list in which
they had been included upon signup) that asked them to fill out a short study evaluation form
that investigated whether participants experienced problems using the software and what the
causes were. The participants were asked how often they fill out an emotional questionnaire
and if this was less than once a day, what the reason was for not doing it more often. The
participants were also asked to indicate how comfortable they are with the study, how often
they use their computers and for which period of times. At the end of the form, participants
could also add extra comments.
35
Chapter 3. Data Collection
This evaluation revealed some software issues and also indicated what caused some partici-
pants to provide little amounts of data. The main two causes for providing little amounts of
data were software issues (which will be discussed later) and laziness or inappropriate timing
of the notification. Almost all participants felt comfortable with providing the information, as-
suming the information remains confidential and is only used for academic research purposes.
Some participants were more sceptical. Most of the participants also did not have troubles
indicating their emotional state using the software. Some participants mentioned that they
thought the scales were unclear or they did not understand the PAD emotion model very
well. Most participants spent 3 hours a day on their computer on average and this time is
divided over different sessions of more than an hour on average.
3.1.5 Completion
On May 1st, 2016 all participants received an email (through the mailing list) that notified
them that the data collection process was finished and that they could remove the software
from their computer. Manual check-ups were also performed to make sure that they had
successfully removed the software. The participants were also notified that they could request
the data that was collected about them, if desired.
3.2 Participant Demographics
As mentioned before, some additional information was obtained from the participant pop-
ulation to be able to create a profile for each participant and possibly use this information
for improving the predictive models. This additional information was obtained during the
registration phase when the participant signed up for the study. A relevant list of profiling
questions can be found in Table 3.1. For question 8 the participants could choose between
left or right. For question 9 the participants could choose either low (typing at a rate less
than 180 keystrokes per minute), intermediate (typing at a rate higher than 180 keystrokes
per minute and less than 300 keystrokes per minute) or high (typing at a rate higher than 300
keystrokes per minute). For question 10 the participants could choose either low (elementary
use), intermediate (experience with a large set of applications and a basic understanding of the
operating system in use) or high (very experienced in programming and a good understanding
of computer logic).
There were 22 participants originally partaking in the study but only 14 of them were suf-
ficiently active (>10 labeled samples). The data of the less active participants was removed
and only the responses to the questions of Table 3.1 of the participants that were sufficiently
active will be presented.
Of the 14 active participants, 12 were male and 2 were female. The age distribution can be
36
Chapter 3. Data Collection
Table 3.1: Demographic questions during registration
1 Gender
2 Birthdate
3 Occupation
4 Education
5 Nationality
6 First language
7 Most used language on this computer
8 Dominant hand
9 Typing skills
10 Computer skills
11 Percentage of total computer time that is spent on this computer
12 Keyboard layout
13 Mouse type
14 Computer type
found in Figure 3.1. All participants were students and the specific education distribution can
be found in Figure 3.2. All participants were Belgian and had Dutch as their first language.
All participants mostly used Dutch on their computers except for 1 participant who used
English as much as Dutch. 12 participants were right handed while 2 participants were left
handed. The distributions for typing skills, computer skills and percentage of total computer
time that is spent on the computer used for the study can be found in Figure 3.3. All
participants used the AZERTY keyboard layout. The distribution of the mouse type can be
found in Figure 3.4. Note that it is possible that a participant uses multiple pointing devices.
One participant used a desktop computer while the other 13 participants used laptops.
3.3 Software
The data collection software was a Windows application, called Behaviour Analyst, developed
in the C# language using the .NET 4.5 framework and was tested on Windows 10 but should
work in any Windows environment that supports .NET 4.5.
3.3.1 Installation, Automatic Updates and Heartbeat
The software was distributed through a website where participants could download an in-
staller package and view an easy and detailed installation manual (see Appendix D). The
installer package was developed using the Nullsoft Scriptable Install System5 and presented
5NSIS: http://nsis.sourceforge.net/
37
Chapter 3. Data Collection
20 21 22 23 24
1
2
3
4
5
6
Age
Am
ou
nt
Figure 3.1: Age distribution of participants
Econo
mics
Engin
eerin
g&
Com
pute
rSc
ienc
es
Psych
olog
yLaw
Politi
cal Sc
ienc
es
Hist
ory
Med
icin
e&
Hea
lthSc
ienc
es
1
2
3
4
5
Education
Am
ou
nt
Figure 3.2: Education distribution of participants
38
Chapter 3. Data Collection
low intermediate high
1
2
3
4
5
6
7
8
9
Am
ount
Typing
Computer
(a) Distribution of participants typing
skills and computer skills
50-5
9%
60-6
9%
70-7
9%
80-8
9%
90-1
00%
1
2
3
4
5
6
7
Am
ount
(b) Time spent on the computer
Figure 3.3: Distributions of skills and time spent on the computer of participants
trackpad external
1
2
3
4
5
6
7
8
9
10
Mouse type
Am
ount
Figure 3.4: Mouse type distribution of participants
39
Chapter 3. Data Collection
the participant with a short license agreement before installing the software on the system,
adding it to the startup programs in the operating system so that it would also run after
a reboot and adding configuration settings to the Windows registry. Screenshots of the in-
stallation process are shown in Figure 3.5. After the installation finished, the software was
automatically started in the background. The software is only visible as a small icon in the
system tray (see Figure 3.6). The participant had to doubleclick this icon to make a control
panel pop up. This is shown in Figure 3.7. This control panel contained some configuration
settings, including the participants id. Initially, this was empty and had to be filled in by
each participant manually. The id that needed to be filled in was automatically generated
by the download page and included in the installation manual. When the participant had
filled in the correct id and clicked the Apply-button, the control panel disappeared and the
installation was finished.
(a) Introduction page (b) License agreement
(c) Installation destination (d) Finish and startup
Figure 3.5: Software installation
40
Chapter 3. Data Collection
Figure 3.6: Icon in the system tray
Figure 3.7: Control panel
The software ran a check for available updates each time it was started and then following
each 30 minutes. This was done by downloading an XML file from an update server that
contained information on the most recent version of the software. If the software detected
that it was not up-to-date, it would immediately download the installer package and run it
in silent mode. The installer package made sure that the software was shut down properly
before updating the necessary files and starting the software again. The participants did not
notice anything of the whole updating process.
While performing a check for available updates, a heartbeat was also automatically sent to
a heartbeat server. This heartbeat contained the participant’s id and the software version
being used.
3.3.2 Capturing Data
Three types of data were captured: keystrokes, mouse movements and location data. Weather
information was derived from the location data.
41
Chapter 3. Data Collection
Keystrokes
The data collection software used a low-level Windows function accessed by unmanaged code
in C# allowing to process each keystroke before it is forwarded to the intended application.
All capturing of keystrokes was handled by a monitor class. This class also makes use of a
time window to make sure that the participant is not bothered with emotional questionnaires
too often or when keystroke data is not relevant anymore. It does this by keeping separate log
files for each window and keeping track of the current window. The window is always called
after the Unix timestamp6 (in milliseconds) of the first keystroke record that is contained
by it and the corresponding log file is called k[first window timestamp].log. This window can
expire when there have been no keystrokes for a long time. This is called the expiration time
in the software and can be set in the control panel. In this research, a value of 1 hour (or
3600000 ms) was used. When a keystroke is captured, the monitor first checks whether the
time that has passed since the last keystroke is smaller than the expiration time. If this is not
the case, the active window is terminated and the data contained by it is sent to the server,
using a secured connection, without asking the participant to indicate his emotional state,
and a new window is started. Captured data without emotion information is called unlabeled
data. When the active window is not expired, the monitor checks whether the keystroke is
either a keydown or keyup event. Next, a line containing information about the keystroke
(timestamp, keydown/keyup, key value, virtual key code and the title of the active window)
is added to the log file of the window.
Then, the monitor will check whether the active window contains a certain amount of keystrokes
that are keydown events. This is called the keystroke threshold and can also be set in the
control panel. In this research, a value of 1000 keydowns was used. The monitor will also
check whether the time that has past since the start of the window is larger than the so-
called time threshold. The time threshold makes sure that participants are not bothered too
often. It can be set in the control panel and in this research a value of 1 hour was used.
Only when both of these conditions are met, the participant is asked to fill out an emotional
questionnaire. This is done using a native Windows notification (see Figure 3.8). When this
notification is ignored, the data contained by the window is again sent to the server, using a
secure connection, as unlabeled data and a new window is started.
Figure 3.8: Questionnaire request notification
6Unix time: https://en.wikipedia.org/wiki/Unix_time
42
Chapter 3. Data Collection
Mouse movements
Mouse movements are also captured by the same monitor class using a low-level Windows
function accessed by unmanaged code in C#. However, not all movements are logged. The
monitor uses a so-called mouse frequency which is expressed in Hz (1/s) and can be set in
the control panel. This frequency determines how often mouse movements should be logged.
Each time the monitor captures a mouse movement, it checks whether the time (in ms) that
has passed since the last mouse movement in the active window is greater than or equal to1000
mousefrequency . If this is the case, the coordinates of the mouse are logged together with a
timestamp. In this research, the mouse frequency was set to 1 Hz. The log files for the
mouse movements are stored separately from the keystroke log files. However, they are also
split according to the keystroke windows, have a similar naming structure (m[first window
timestamp].log) and are also sent to the server when the window is terminated (for both the
unlabeled and labeled cases).
Location and Weather
For capturing the participant’s location information, the software uses the GeoCoordinate-
Watcher class of the .NET framework. The location of the participant is kept up-to-date
asynchronously and each time a location update is performed, the weather information for
the new location is also requested. Weather information is obtained through the OpenWeath-
erMap7 API. This API takes coordinates in the form of (latitude, longitude)-pairs and returns
weather information such as temperature, humidity, air pressure and more.
3.3.3 Questionnaire
When the participant receives a notification that requests to indicate his emotional state and
clicks this notification, the participant is presented with a small questionnaire form. This form
is shown in Figure 3.9. It contains three sliders, one for each dimension of the PAD emotion
model. These can be used by the participant to indicate his emotional state. The participant
can also view a window containing extra explanation on how to indicate emotions in the PAD
model (the Tips window, see Figure 3.10), when clicking on the button with the question mark
on it. This Tips window is also shown automatically when the participant clicks a notification
for the first time before he is presented with the questionnaire form. When the participant has
indicated his emotional state, the Submit-button can be pressed and the window containing
the keystroke and mouse data is packed into a data package together with the location and
weather information at that moment. Furthermore, the data package contains a timestamp
of the moment when the Submit-button is clicked as well as the time that has past between
clicking the notification and clicking the Submit-button (called delay, to indicate how much
7OpenWeatherMap: http://www.openweathermap.org
43
Chapter 3. Data Collection
time the participant needed to indicate his emotional state). This data package is sent to the
server using a secured connection.
Figure 3.9: Emotion questionnaire form
Figure 3.10: Tips window
When a participant clicks the notification, but then decides to close the questionnaire form
before submitting any emotional state information, the data package is still created but will
not contain timing information. It will then be sent to the server as unlabeled data, again
using a secured connection. An unlabeled data package has the values -1 for each PAD emotion
dimension and 0 as questionnaire delay value. When the participant is not connected to the
internet or when the software for some reason fails to transmit data to the server, the data that
44
Chapter 3. Data Collection
needs to be sent is saved in a buffer file. This happens for all cases of data transmission (both
unlabeled and labeled data). Each time the software is started, it will first check whether
there is still data in the buffer file and will try to resend it if this is the case. This is why it
is interesting to keep the timestamp of the moment when the Submit-button is clicked. In
case the data is sent at a later time than when the data was submitted, the time of the data
package arriving at the server will be different from the time when it was submitted.
All log files are also deleted each time the software is started so no unnecessary space is used.
3.3.4 Server and Database
The entire server program was programmed using PHP. All data is stored in a MySQL
database on the server. A complete database schema can be found in Figure 3.11. For
all server requests that require database operations, these operations were contained in a
database transaction to make sure that requests were either completely executed (commit)
or failed without affecting the database otherwise (rollback). There are two main processing
parts: the data processor and the heartbeat processor.
Figure 3.11: Database schema
The heartbeat processor accepts POST requests containing a user and version field. Such
a request is called a heartbeat. On reception of each heartbeat, the server updated the
timestamp of the last heartbeat received of the corresponding participant, as well as the
software version the participant was using. This way, it could easily be seen if software
problems occurred at the user’s side. This will be discussed in more detail below. The
45
Chapter 3. Data Collection
data processor processes all data packages (either labeled or unlabeled) that are sent by
the software. It accepts POST requests containing user, pleasure, arousal, dominance,
time, delay, latitude, longitude, keystrokeData, mouseData and weatherData fields. The
keystrokeData, mouseData fields contain JSON8-encoded data that represent collections of
keystroke and mouse movement objects respectively. The weatherData field contains the
JSON-encoded data that is received from the OpenWeatherMap API. First, the request fields
are parsed and a questionnaire record is inserted into the database. The JSON-encoded data
from the weatherData field is directly inserted into the corresponding record field without
further parsing. All data that is inserted into the database fields is escaped to prevent SQL
injection9. Next, the id of the newly inserted record is used as the reference (ref) value for
inserting keystroke and mouse records into the database for each object that is contained in
the corresponding JSON collection.
The server responds to each request by printing a JSON-encoded array that contains a field
with an error code and a corresponding error description. If no errors occurred while process-
ing the request, the error code is 0 and the description will be empty. This server response
is also used by the software to provide feedback to the participants when server errors oc-
cur and to determine whether the data package needs to be stored in the buffer for later
retransmission.
3.3.5 Problems
During the entire course of the study, some problems occurred with the software. These
were mainly small bugs that were quickly fixed. Most of these small bugs only occurred
in very specific cases which resulted in most participants not experiencing any trouble. A
disadvantage in the beginning of the study was the fact that for each software update, all
participants had to be contacted individually for a manual update. This was the main reason
for adding the automatic updates to the software. Using this feature, bug fixes could be
deployed very quickly and efficiently without having to bother the participants.
Another problem that posed itself in the beginning of the study when the software was rolled
out was that there did not exist a simple installation process yet. Originally, a zip10-file,
containing the necessary software files and a command prompt file that opened the Windows
startup folder, needed to be downloaded. Each participant then had to manually extract the
files to a desired destination, create a shortcut to the software in the Windows startup folder
and run the software. This was difficult for many less experienced participants and required
a lot of assistance from the study administrator. Also, the Windows registry was not used
to store configuration settings. Instead, these were stored in a configuration file which made
8JavaScript Object Notation http://www.json.org/9SQL injection http://www.w3schools.com/sql/sql_injection.asp
10Zip file format https://en.wikipedia.org/wiki/Zip_(file_format)
46
Chapter 3. Data Collection
it harder to deploy specific updates regarding configuration settings. Both of these problems
resulted in the creation of a simpler installation process using an NSIS installer and by using
the Windows registry for storing configuration settings.
Participants that had some sort of antivirus software installed on their computers often expe-
rienced the problem that the data collection software was marked as malware and therefore
automatically removed. The heartbeat feature offers a convenient solution here as it could
easily detect when it has been a long time since a heartbeat was received from a participant.
Indeed, this way many cases of automatic software removal by antivirus software were de-
tected and caused the participant to be asked to reinstall the software and continue the data
collection process.
47
Chapter 4
Data Processing
Before models could be built to recognize emotions based on the data that was gathered, this
data needed to be processed and features had to be extracted. This chapter first presents
an overview of specific considerations made before the feature extraction process. Next, the
particular keystroke, mouse and context features that were extracted are discussed as well as
the process of handling the data labels.
4.1 Data Formatting
As mentioned before, all data was persisted in a MySQL database on a data collection server.
It can be seen from Figure 3.11 that this data is structured in a hierarchical fashion. Each
subject record has a number of associated questionnaire records, which in turn have a number
of associated keystroke and mouse records. To be able to work with the data, an efficient way
of navigating through this structure is required. Converting the data to an object-oriented
structure provides an ideal solution. For each participant, a user object can be created that
contains their personal data as its attributes. Furthermore, each user object contains a set of
questionnaire objects that they have submitted. In turn, each questionnaire object contains
the specific questionnaire data, a set of keystroke objects and a set of mouse movement
objects. Python1 was chosen as the main programming language. To convert the data from
the database to such an object-oriented structure, all data can be retrieved from the server
by issuing SQL requests, creating the corresponding objects and linking them to each other.
However, this is a very time-consuming process due to the delay caused by issuing the SQL
requests. Instead, all data was downloaded from the server in a CSV-format2. These CSV-
files could then be read very fast using Python’s csv module3. Next, after having defined
the proper Python classes, the corresponding objects were created and linked in such a way
1Python: https://www.python.org/2Comma-separated values: https://en.wikipedia.org/wiki/Comma-separated_values3Python csv module: https://docs.python.org/2/library/csv.html
48
Chapter 4. Data Processing
that a structured set of objects, which could be easily navigated, was obtained. To make the
process of loading the data even faster, the cPickle module4 was used to serialize the data
structure and persist it to a file that could be easily deserialized as well.
Some special cases needed to be considered during the processing of the data. To be able to
calculate keystroke timing features, corresponding keydown and keyup events needed to be
matched. However, some keydown events did not have a corresponding keyup event, which
caused problems for extracting timing features. This problem is the result of a participant
holding down a key for an extended period of time causing multiple keydown events and only
one keyup event. This is often seen when a user wants to type the same character a lot of times
or when holding down the shift key. This pattern was resolved by not taking into account the
intermediate keydown events between the initial keydown event and the keyup event. It could
also happen that a keydown event does not have a corresponding keyup event or vice versa
as a result of the keystroke window that is used during monitoring. For example, when a
keydown event is received by the software and the conditions for closing up the current window
are met, the keydown event will still be in the current window that is being closed while the
corresponding keyup event will be in the new window. Such events were also ignored by the
data processing scripts. Modifier keys and toggle keys also presented some challenges. Each
key on the keyboard has a unique virtual keycode (the vkcode) associated with it. However,
the exact character that a key represents when pressed depends on the state of the modifier
and the toggle keys. For example: the A-key represents a lowercase ’a’ in normal cases but
an uppercase ’A’ when the shift key is pressed at the same time or when the caps lock key
is toggled. Both modifier keys and toggle keys are handled by the data processing scripts
by defining state variables. When a keydown event for a modifier key or toggle key is seen,
the corresponding state changes. For modifier keys, the corresponding changes again when a
keyup event is seen. This is not the case for toggle keys.
4.2 Feature Extraction
The features that were extracted from the entire dataset can be divided into four categories:
keystroke features, textual content features, mouse movement features and contextual fea-
tures. This section discusses each of these categories in detail.
4.2.1 Keystroke Features
The keystroke data, which was obtained during this study, consisted out of keydown and keyup
events, accompagnied by other characteristics, as explained in Chapter 3. The features, that
were extracted from this data, consist of timing features and frequency features.
4Python cPickle module: https://docs.python.org/2/library/pickle.html
49
Chapter 4. Data Processing
Timing Features
A list of all timing features that were extracted from the keystroke data can be found in
Table 4.1. These features were chosen based on related work.
Table 4.1: Keystroke timing features
Name Description
AvgTypingSpeed The amount of keystrokes in the sample per second.
D2D latency The duration from a keydown event to the next keydown
event.
U2U latency The duration from a keyup event to the next keyup event.
U2D latency The duration from a keyup event to the next keydown
event.
D2U duration The duration from a keydown event to the corresponding
keyup event of that key.
WeightedMean D2D latency The weighted mean of the D2D latencies.
WeightedMean U2U latency The weighted mean of the U2U latencies.
WeightedMean U2D latency The weighted mean of the U2D latencies.
WeightedMean D2U duration The weighted mean of the D2U durations.
Notice that the D2D latency, U2U latency, U2D latency and D2U duration features are actu-
ally lists of values. In these lists, there is a value for each consecutive pair of keydown events,
keyup events, keyup/keydown events and keydown/keyup events respectively. U2U and U2D
latencies can be negative. For example, when a participant is typing quickly, the second key
in a pair of two keypresses may be released before the first key. This can happen when a
participant presses each key with another finger or hand. It is also possible that the second
key in a pair of two keypresses may be pressed before the first key is released. The negative
values were kept in the lists. Furthermore, as mentioned before, extra keydown events re-
sulting from holding down a key for an extended period of time were not taken into account
while calculating these features. For each of these value lists, different summary statistics
were used to obtain single feature values. The summary statistics that were used were: mean,
maximum, minimum, standard deviation, variance, mode, median, skew and kurtosis. A
number of these statistics have also been used in related work and others were used because
they describe the distribution of the value lists (e.g. skew, kurtosis, standard deviation). For
the weighted mean features, the weighting was performed according to the inverse of the time
that had passed between the keystroke event and the submission of the sample. The usage of
weighted means as features is motivated by the possibility that more recent keystroke data,
relative to the moment that the sample was submitted, is more indicative for the sample label
compared to older keystroke data.
50
Chapter 4. Data Processing
An outlier removal process was also performed for the calculation of the D2D latency, U2U latency,
U2D latency and D2U duration feature by using the interquartile range (IQR) of the values
in the lists. This was necessary such that extremely long latencies and durations, caused by
long pauses or incorrectly matched keystrokes, would not be taken into account. For each
list of values, the IQR was calculated and all values higher than the third quartile value +
1.5 times the IQR were removed from the list and not taken into account while calculating
the summary statistics. This is a commonly used technique for outlier removal [59]. The
parameter of 1.5 times the IQR could be easily chosen in the scripts.
In total, 41 keystroke timing features were extracted. Their box-and-whisker plots can be
found in 4.1.
Figure 4.1: Box-and-whisker plots for keystroke timing features
Frequency Features
A list of all frequency features that were extracted from the keystroke data can be found in
Table 4.2. These features have been chosen based on related work.
Note that the NumChar freq, AlphabetChar freq, Space freq, Return freq, Punctuation freq
and AvgWordLength features only give an indication of the text that was actually typed with
the keyboard and not of all the text that was used. For example, it is possible that a partic-
ipant copy-pastes text from a source and does not actually type any of the used characters.
These pieces of copy-paste text are not included in the features. Also the Error freq feature
does not include all forms of error correction. A participant may select a piece of text that
is longer than one character and press the delete or backspace key only once to remove the
51
Chapter 4. Data Processing
Table 4.2: Keystroke frequency features
Name Description
NumChar freq The frequency of numerical characters in the sample.
AlphabetChar freq The frequency of alphabetical character in the sample.
Del freq The frequency of the usage of the delete key in the sample.
Backspace freq The frequency of the usage of the backspace key in the sample.
Errors freq The frequency of errors (sum of deletes and backspaces) in the sample.
Shift freq The frequency of the usage of one of the shift keys in the sample.
Space freq The frequency of the usage of the space key in the sample.
Arrow freq The frequency of the usage of one of the arrow keys in the sample.
CapsLock freq The frequency of the usage of the caps lock key in the sample.
Return freq The frequency of the usage of one of the return keys in the sample.
Punctuation freq The frequency of the usage of a punctuation symbol (’.’, ’,’, ’?’, ’ !’,
’:’ or ’;’) in the sample.
AvgWordLength The average length of the words in the sample.
LongPause freq The frequency of long periods without typing in the sample.
entire piece of selected text instead of pressing the delete or backspace key once for each
character that needs to be removed. A participant may even select a piece of text and start
writing other text to correct an error without ever pressing the backspace or delete key. The
LongPause freq feature is calculated by calculating all U2D latencies and then calculating the
frequency of values that are higher than a certain threshold value. This threshold was put at
60 seconds, because this seemed like a realistic threshold for detecting non-continuous typing,
and could be easily set in the script.
In total, 13 keystroke frequency features were extracted. Their box-and-whisker plots can be
found in 4.2. The box-and-whisker plot for the CapsLock freq was left out as these feature
values were 7.9% of the time not equal to zero such that its box-and-whisker plot would
obfuscate the figure.
52
Chapter 4. Data Processing
Figure 4.2: Box-and-whisker plots for keystroke frequency features
4.2.2 Textual Content Features
Before extracting textual content features, the pieces of text contained by each sample needed
to be reconstructed. Remember that only a set of keystroke events accompagnied by other
characteristics is available. To obtain the entire text content, only the keydown events were
considered as these are the ones that give rise to the production of characters. The states
for the modifier and toggle keys also needed to be kept, as explained in the previous section.
These states were taken into account when determining which exact character is represented
by each keystroke event. Special characters that are not alphanumeric (except for spaces and
new lines) are ignored and not included in the reconstructed text. Backspaces are handled by
removing the last character of the text string at the moment when the backspace key occurs.
When the pieces of text have been reconstructed for each sample, textual content features
need to be extracted. The sklearn package5 for Python has a module containing feature
extraction methods for text data. The first step is to convert the text pieces to a matrix of
token counts, i.e. bag of words, which produces a sparse representation of the counts. This
matrix can then be transformed to a normalized term frequency or TF-IDF representation.
The number of features that is produced by this technique depends on the collection of text
pieces that is used as input.
5Scikit-learn: http://scikit-learn.org/
53
Chapter 4. Data Processing
4.2.3 Mouse Movement Features
One mouse movement feature was extracted, based on related work: the average mouse speed.
To obtain this feature, the euclidian distance between each two mouse movement records in
a sample was calculated and divided by the time that has passed between moving from the
first position to the second position. The average of these speeds is the average mouse speed.
4.2.4 Contextual Features
As some contextual information was also collected, such as location and weather data, during
the study, this data could be used as well. The actual coordinates of participants are not
particularly interesting as these will probably be different for each participant. Weather
data is however less variable and can be used to infer a participant’s mood [32]. More
specifically, the temperature, pressure and humidity from this weather data were extracted.
The discomfort index was also used as a contextual feature. The formula for this index is as
follows: DI = T − (0.55 ∗ (1− (0.01 ∗H)) ∗ (T − 14.5)) where T is the temperature expressed
in ◦C and H is the relative humidity expressed in percents. The box-and-whisker plots can
be found in Figure 4.3.
Figure 4.3: Box-and-whisker plots for contextual features
Due to the fact that some of the participants have disabled the location services on their
computer, not all samples contain weather data as this depends on the location of the partic-
ipant. Thus, these features contain missing data and the models need to be able to deal with
this in order to use them.
54
Chapter 4. Data Processing
In total, 1177 labeled samples were collected and 59 features were extracted. These can be
used to build machine learning models for emotion recognition.
55
Chapter 5
Model Building
In this chapter the different models that were built are discussed. A first subdivision is made
based on whether they are a regression or classification model. Next, a second subdivision is
made based on whether the model is built using data of all participants or whether separate
models are built, one per participant. Finally, a third subdivision is made based on whether
the model is built using keystroke dynamics data, mouse dynamics data and contextual data
or if the model is built using textual content data. For the former model, i.e. dynamics
models, random forest models were used with 500 trees and a maximum number of features
used per tree that is equal to the square root of the total amount of available features. The
random forest algorithm is a robust algorithm that is not prone to overfitting. It is known
to outperform the SVM algorithm and requires little preprocessing. For the latter models,
i.e. models using textual content data, SVMs with a linear kernel were used. SVMs have the
advantage of being able to deal with a high dimensional dataset, as this is the case for textual
content. The optimization of the model hyperparameters will be discussed later. For each
model type, different approaches were used to obtain a good model. For the model evaluation,
10-fold cross-validation was used.
5.1 Regression
This section describes the approaches that were used to build regression models and is orga-
nized according to the second and third model type subdivisions, as described above. The
goal of regression models is to predict one or more axis values in the PAD emotion model. To
evaluate the different models, a number of error measures were used. These are the R2 score,
the mean absolute error (MAE), the mean squared error (MSE) and the explained variance
score (EVS). The R2 score is the most important metric as it presents a good view on how
much better the model performs compared to always taking the mean value as a prediction.
The EVS is very similar to the R2 score and the MAE and MSE present a more natural
interpretation of the model performance.
56
Chapter 5. Model Building
5.1.1 General Dynamics Models
First, the general dynamics regression models are described. These are models that are built
using keystroke dynamics data, mouse dynamics data and contextual data of all participants.
Different approaches were used to build multiple models. Three degrees of freedom were used.
The first is whether contextual data, i.e. weather data, is used. This yields different results
as some participants did not have their location services enabled which caused their data not
to contain weather information. As the random forest implementation that is used is not
capable of dealing with missing data, this means that all samples that do not contain the
weather information need to be removed if this information needs to be taken into account.
The second degree of freedom is whether a model is built for each separate dimension of the
PAD model or whether one model is built to predict the entire PAD-tuple (joint). The third
degree of freedom that was used is whether feature selection was performed or not. To perform
feature selection, a random forest model was built using all features and then the 20 most
important features were extracted. Then the model for the actual regression tasks was built
using only these features. In total, 8 different random forest models for regression1 were built
and evaluated. The results can be found in Table 5.1. The best result is obtained with the
combination of including weather data, using the joint PAD space and using feature selection
(bold). This model was built using 630 labeled samples. The metrics that are presented were
calculated for each dimension of the PAD space and then averaged to obtain a result for the
entire model. Note that the best model obtains a R2-score of 0.1766 which is only slightly
better than always predicting the mean. The average MAE is 17.66 and the average of the
standard deviation on these MAE over the different dimensions is 11.767. This indicates that
this model does not perform very well.
1Random Forest Regressor http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
RandomForestRegressor.html
Table 5.1: General dynamics regression models results
Weather PAD Features R2 MAE MSE EVS
No
JointAll 0.143689528 17.78790994 453.023691 0.144182514
Best 20 0.161537472 17.54150949 443.7466007 0.161826172
SeparateAll 0.144214673 17.729484 452.6435168 0.144586658
Best 20 0.146617402 17.66438006 451.4649129 0.146919884
Yes
JointAll 0.14692657 18.16206667 469.8446163 0.14715312
Best 20 0.176618405 17.66490053 453.380924 0.17674802
SeparateAll 0.153372875 18.01380741 466.3389916 0.153531582
Best 20 0.163933007 17.85844233 460.7601332 0.164102493
57
Chapter 5. Model Building
5.1.2 Individual Dynamics Models
The individual dynamics regression models are very similar to the general dynamics regression
models except that they are built using data of only one participant at a time instead of
using data of all participants. Individual models were built for 8 participants as only these
participants who provided at least 59 samples were included, such that the amount of samples
was at least equal to the number of possible features. The same degrees of freedom were used
to build multiple variants of the models and the evaluation results can be found in Table 5.2.
Again, the best result on average is obtained with the combination using the joint PAD space
and using feature selection, but now without including weather data. The metrics that are
presented were calculated for each participant and for each dimension of the PAD space and
then averaged to obtain a result for the entire model. Note that the best averaged model
(bold) obtains a R2-score of 0.048 which is not significantly better than always predicting the
mean. The average MAE is 16.19 and the average of the standard deviation on these MAE
over the different participants and dimensions is 10.55. This again indicates that this model
does not perform very well. The fact that not including weather data now yields a better
result, can probably be explained by the fact that only 3 participants provided more than 59
samples that contained weather information, as can be seen in Table 5.3.
5.1.3 General Text Content Models
To build general text content regression models, the text content feature data of all partici-
pants is used to train a SVM for regression2. There were no degrees of freedom in these models
as the SVM implementation that was used is not capable of predicting the entire PAD-tuple,
so different models for each dimension were built. Furthermore, all textual content features
are equally important by definition so there can be no degrees of freedom here. The detailed
evaluation result can be found in Table 5.4. This model was built using 1177 labeled samples.
2Epsilon-Support Vector Regression: http://scikit-learn.org/stable/modules/generated/sklearn.
svm.SVR.html
Table 5.2: Individual dynamics regression models results
Weather PAD Features R2 MAE MSE EVS
No
JointAll 0.008913228 16.65004196 411.5193209 0.009823612
Best 20 0.048052021 16.18678528 394.3835932 0.048783084
SeparateAll 0.003724599 16.63586158 414.2228234 0.004583854
Best 20 0.027230397 16.34792232 404.0040023 0.027999946
Yes
JointAll -0.046333662 17.7818626 478.6445918 -0.045494928
Best 20 -0.011577862 17.28880378 459.9817373 -0.011153212
SeparateAll -0.05383741 17.87304252 482.1243925 -0.052686205
Best 20 -0.023891601 17.42099593 464.6577326 -0.022783861
58
Chapter 5. Model Building
Table 5.3: Number of samples per participant in individual dynamics models (both regression and
classification)
User Amount of samples
Weather data excluded Weather included
1 71 0
2 79 57
3 82 82
4 249 157
5 118 39
6 66 0
7 93 0
8 230 156
Total 988 491
The table presents the metrics that were calculated for each dimension of the PAD space and
also the averaged metrics that are used to obtain a result for the entire model. Note that a
R2-score of only 0.036 is achieved, which is only slightly better than always predicting the
mean. The average MAE is 19.69 and the average of the standard deviation on these MAE
over the different dimensions is 10.89. This indicates that this model does not perform very
well.
5.1.4 Individual Text Content Models
The individual text content regression models are very similar to the general text content
regression models but are built using data of only one participant at a time instead of using
data of all participants. Individual models were built for 13 participants. The evaluation
results can be found in Table 5.5. The table presents the metrics that were calculated for
each participant and for each dimension of the PAD space and also the averaged metrics
that were used to obtain a result for the entire model. Note that an averaged R2-score of
-0.05 is obtained, which is worse than always predicting the mean. The average MAE is
17.63 and the average of the standard deviation on these MAE over the different participants
and dimensions is 10.71. This indicates that this model performs very bad. The number of
samples used for each participant can be seen in Table 5.6.
5.1.5 Fuzzy Logic
Predicted PAD axis values need to be interpreted as an emotional state, or better yet, a
weighted combination of multiple emotional states. This requires mapping discrete emotional
states onto the PAD space. Different mappings have been proposed and empirical approaches
to building such mappings have been taken [31]. When we assume that a person experiences
a weighted combination of multiple emotions, fuzzy logic can be used to determine the extent
59
Chapter 5. Model Building
Table 5.4: General text content regression models results
R2 0.035591173
MAE 19.69351149
MSE 510.2788305
EVS 0.04181899
P-R2 0.012589376
P-MAE 19.66656626
P-MSE 484.3517047
P-EVS 0.026537171
A-R2 0.025445702
A-MAE 21.57589911
A-MSE 568.0424353
A-EVS 0.029893784
D-R2 0.06873844
D-MAE 17.83806909
D-MSE 478.4423514
D-EVS 0.069026015
Table 5.5: Individual text content regression models results
R2 -0.050807597
MAE 17.63898515
MSE 456.5681924
EVS -0.034470865
P-R2 -0.037283712
P-MAE 17.88843121
P-MSE 455.8425914
P-EVS -0.026363772
A-R2 -0.057245424
A-MAE 19.62208148
A-MSE 516.8555083
A-EVS -0.040458861
D-R2 -0.057893653
D-MAE 15.40644276
D-MSE 397.0064775
D-EVS -0.036589962
60
Chapter 5. Model Building
Table 5.6: Number of samples per participant in individual text content models (both regression and
classification)
User Amount of samples
1 31
2 18
3 249
4 93
5 33
6 230
7 118
8 41
9 66
10 79
11 82
12 71
13 41
Total 1152
to which each emotion is present, given PAD axis values. This can be done by defining a set
of fuzzy rules that define the relationship between the PAD axis values and each emotion.
For example, if the mapping, presented in Table 5.7, is used, the following fuzzy rules can be
defined:
1. IF pleasure IS negative AND arousal IS negative AND dominance IS negative
THEN bored IS present
2. IF pleasure IS negative AND arousal IS negative AND dominance IS positive
THEN disdainful IS present
3. IF pleasure IS negative AND arousal IS positive AND dominance IS negative
THEN anxious IS present
4. IF pleasure IS negative AND arousal IS positive AND dominance IS positive
THEN hostile IS present
5. IF pleasure IS positive AND arousal IS negative AND dominance IS negative
THEN docile IS present
6. IF pleasure IS positive AND arousal IS negative AND dominance IS positive
THEN relaxed IS present
7. IF pleasure IS positive AND arousal IS positive AND dominance IS negative
THEN dependent IS present
61
Chapter 5. Model Building
8. IF pleasure IS positive AND arousal IS positive AND dominance IS positive
THEN exuberant IS present
The fuzzy terms ”negative”, ”positive” and ”present” can then be defined by membership
functions that take values between 0 and 1. Example membership functions for these terms
are presented in Figure 5.1. The IS-operator calculates the function value for the variable
on its left-hand side using the membership function for the fuzzy term on its right-hand
side. The AND-operator can be implemented using a so-called t-norm and also for combining
multiple rules concerning the same fuzzy terms and defuzzification, there are multiple possible
approaches [67]. Applying such fuzzy rules to the predicted PAD axis values results in a set
of membership values for each emotion that can be interpreted as the extent to which each of
these emotions are present in the current emotional state of a person. Using fuzzy logic has
the advantage of reducing the importance of the accuracy of predicted PAD axis values as
less accurate values can still result in a correct conclusion concerning the dominant emotion
due to the fact that fuzzy logic has the capacity to take into account the inherent fuzziness
of the emotion information. Furthermore, it is also possible to draw conclusions about which
emotions are likely to occur at the same time.
5.2 Classification
This section describes the approaches that were used to build classification models and is again
organized according to the second and third model type subdivisions, as described above. To
be able to perform classification, the PAD emotion model first needs to be divided into a
number of different classes. Three different approaches were used to divide the PAD emotion
model in different classes. The first approach uses the k-means clustering algorithm to assign
all samples that belong to the same cluster to the same class so that k classes are obtained.
The value of k was chosen to be 8 to divide the entire PAD space into different classes. This
choice is made based on the fact that the PAD space is three-dimensional and thus contains
8 octants. The value of k was chosen to be 2 to divide only one PAD dimension into different
Table 5.7: PAD mapping according to octants
PAD octant Emotion
P-A-D- Bored
P-A-D+ Disdainful
P-A+D- Anxious
P-A+D+ Hostile
P+A-D- Docile
P+A-D+ Relaxed
P+A+D- Dependent
P+A+D+ Exuberant
62
Chapter 5. Model Building
30 70
0.5
1
x
membership
negativepositive
(a) Terms ’negative’ and ’positive’
0.5 1
0.5
1
x
present
(b) Term ’present’
Figure 5.1: Membership functions
classes. The second approach splits each PAD dimension into two parts (a negative and a
positive part, respectively having values between 0 and 50 and between 50 and 100). This
way, two classes are obtained per dimension or 23 = 8 classes for the entire PAD space. The
third approach is similar to the second one but splits each PAD dimension into three parts (a
negative, a neutral and a positive part, respectively having values between 0 and 40, between
40 and 60 and between 60 and 100). This yields three classes per dimension or 33 = 27
classes for the entire PAD space. The goal of classification models is then to predict the
correct class for each sample. To evaluate the different models, a number of error measures
were used. These are the weighted precision, weighted recall, weighted F1-score (incorporates
both precision and recall) and weighted ROC AUC score (area under the receiver operating
characteristic curve). Since the classes are highly imbalanced in both label sets, the accuracy
score is not included. For example, if the classifier labels all the samples as the majority class,
a good accuracy score would still be achieved, though in fact the classifier was not able to
learn the underlying pattern of the data. The confusion matrices will also be presented for
the best results.
5.2.1 General Dynamics Models
Different approaches were used to build multiple general dynamics classification models. Six
degrees of freedom were used. The first is whether contextual data, i.e. weather data, is
used, as described above. The second degree of freedom is again whether a model is built
for each separate dimension of the PAD model or whether one model is built to predict the
entire PAD-tuple. The third degree of freedom is whether a class system based on the k-
means clusters is used, a system based on positive-negative-scales is used or a system based
on positive-neutral-negative-scales is used. The fourth degree of freedom is whether no class
63
Chapter 5. Model Building
weight balancing is performed, general class weight balancing is performed or subsample
class weight balancing is performed. General class weight balancing associates weights with
classes that are inversely proportional to class frequencies in the input data. Subsample
class weight balancing essentially does the same thing but adjusts the class weights according
to the class frequencies in the bootstrap sample for each tree grown. The fifth degree of
freedom is whether or not subsampling is performed in order to reduce bias due to class skew.
Subsampling removes random samples from each class that contains more samples than the
class that occurs least frequently until every class contains an equal number of samples. The
sixth degree of freedom that was used is whether feature selection was performed or not.
The feature selection process is again done using the feature importance deduced from a
random forest model. In total, 96 different random forest models for classification3 were
built and evaluated. The results can be found in Table 5.8. The metrics that are presented
were calculated for the entire PAD-tuple when it concerns models that use the joint PAD
space. When it concerns models that use separated PAD dimensions, they were calculated
for each dimension of the PAD space and then averaged to obtain a result for the entire
model. The ROC AUC score indicates that the best result is obtained with the combination
of not including weather data, using the joint PAD space that is divided into classes using
k-means clustering (with k = 8), not using class weight balancing, using subsampling and
using feature selection (bold). This model was built using 704 labeled samples and achieves a
ROC AUC score of 0.75 and a F1 score of 0.57 (± 0.058). The corresponding confusion matrix
is presented in Table 5.9. However, note that all models built on separate PAD dimensions
using classes determined by the k-means clusters (with k = 2) or by the positive/negative
class definition, no class weight balancing and subsampling (underlined) have similar ROC
AUC scores to this model but also have much higher precision, recall and F1 scores. So these
models actually perform better.
3Random Forest Classifier http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
RandomForestClassifier.html
64
Chapter 5. Model Building
Table 5.8: General dynamics classification models results (CWB = class weight balancing, SS =
subsampling)
Weather PAD Classes CWB SS Features Precision Recall F1ROC
AUC
No
Joint
K-
means
(k=8)
GeneralNo All 0.32 0.336 0.309 0.602
No Best 20 0.301 0.318 0.292 0.591
SubsampleNo All 0.312 0.333 0.305 0.6
No Best 20 0.293 0.308 0.284 0.586
No
No All 0.318 0.335 0.304 0.6
No Best 20 0.31 0.329 0.302 0.597
Yes All 0.566 0.561 0.562 0.749
Yes Best 20 0.573 0.568 0.57 0.753
+/-
GeneralNo All 0.367 0.415 0.327 0.559
No Best 20 0.394 0.424 0.333 0.565
SubsampleNo All 0.362 0.41 0.321 0.555
No Best 20 0.397 0.42 0.329 0.561
No
No All 0.35 0.412 0.334 0.565
No Best 20 0.362 0.41 0.33 0.563
Yes All 0.4 0.409 0.401 0.662
Yes Best 20 0.4 0.412 0.404 0.664
+/0/-
GeneralNo All 0.236 0.344 0.219 0.534
No Best 20 0.239 0.343 0.22 0.534
SubsampleNo All 0.223 0.341 0.215 0.531
No Best 20 0.238 0.342 0.22 0.533
No
No All 0.261 0.346 0.234 0.544
No Best 20 0.239 0.347 0.236 0.544
Yes All 0.031 0.074 0.043 0.519
Yes Best 20 0.033 0.074 0.046 0.519
Separate
K-means
(k=2)
GeneralNo All 0.665 0.649 0.633 0.606
No Best 20 0.68 0.665 0.648 0.617
SubsampleNo All 0.669 0.655 0.639 0.611
No Best 20 0.677 0.674 0.653 0.617
No
No All 0.671 0.656 0.645 0.614
No Best 20 0.671 0.67 0.653 0.618
Yes All 0.722 0.735 0.728 0.726
Yes Best 20 0.702 0.718 0.71 0.706
+/-
GeneralNo All 0.694 0.858 0.766 0.612
No Best 20 0.696 0.863 0.77 0.616
SubsampleNo All 0.695 0.857 0.767 0.614
No Best 20 0.696 0.866 0.772 0.617
No
No All 0.696 0.84 0.761 0.613
No Best 20 0.697 0.846 0.764 0.616
Yes All 0.721 0.731 0.726 0.724
Yes Best 20 0.716 0.728 0.722 0.719
+/0/-
GeneralNo All 0.583 0.627 0.555 0.565
No Best 20 0.594 0.634 0.563 0.571
SubsampleNo All 0.592 0.632 0.561 0.571
65
Chapter 5. Model Building
No Best 20 0.61 0.637 0.568 0.575
No
No All 0.586 0.628 0.566 0.575
No Best 20 0.606 0.639 0.579 0.586
Yes All 0.586 0.586 0.586 0.69
Yes Best 20 0.54 0.541 0.54 0.656
Yes
Joint
K-means
(k=8)
GeneralNo All 0.311 0.319 0.303 0.597
No Best 20 0.319 0.325 0.314 0.602
SubsampleNo All 0.326 0.335 0.322 0.608
No Best 20 0.294 0.308 0.293 0.592
No
No All 0.332 0.333 0.317 0.605
No Best 20 0.347 0.359 0.341 0.621
Yes All 0.517 0.518 0.515 0.725
Yes Best 20 0.522 0.527 0.52 0.73
+/-
GeneralNo All 0.397 0.41 0.324 0.553
No Best 20 0.35 0.397 0.315 0.551
SubsampleNo All 0.395 0.422 0.332 0.561
No Best 20 0.374 0.413 0.333 0.562
No
No All 0.369 0.403 0.329 0.557
No Best 20 0.407 0.419 0.353 0.572
Yes All 0.266 0.288 0.27 0.593
Yes Best 20 0.321 0.327 0.317 0.615
+/0/-
GeneralNo All 0.281 0.351 0.221 0.532
No Best 20 0.236 0.343 0.218 0.529
SubsampleNo All 0.266 0.344 0.215 0.527
No Best 20 0.262 0.34 0.217 0.526
No
No All 0.24 0.341 0.231 0.535
No Best 20 0.243 0.333 0.229 0.53
Yes All 0 0 0 0.481
Yes Best 20 0 0 0 0.481
Separate
K-means
(k=2)
GeneralNo All 0.613 0.425 0.475 0.597
No Best 20 0.629 0.458 0.509 0.618
SubsampleNo All 0.603 0.416 0.465 0.591
No Best 20 0.638 0.449 0.503 0.611
No
No All 0.595 0.448 0.493 0.602
No Best 20 0.621 0.472 0.521 0.621
Yes All 0.719 0.704 0.711 0.714
Yes Best 20 0.735 0.711 0.722 0.727
+/-
GeneralNo All 0.68 0.843 0.752 0.604
No Best 20 0.691 0.84 0.758 0.616
SubsampleNo All 0.676 0.837 0.747 0.597
No Best 20 0.679 0.83 0.746 0.599
No
No All 0.677 0.819 0.741 0.596
No Best 20 0.686 0.82 0.747 0.607
Yes All 0.716 0.729 0.722 0.72
Yes Best 20 0.693 0.716 0.704 0.699
+/0/-
GeneralNo All 0.578 0.629 0.56 0.567
No Best 20 0.59 0.637 0.576 0.58
SubsampleNo All 0.582 0.63 0.56 0.567
66
Chapter 5. Model Building
No Best 20 0.599 0.642 0.584 0.588
No
No All 0.581 0.631 0.57 0.576
No Best 20 0.587 0.636 0.584 0.592
Yes All 0.544 0.542 0.542 0.656
Yes Best 20 0.591 0.59 0.59 0.692
5.2.2 Individual Dynamics Models
For the individual dynamics classification models, the same degrees of freedom as for the
general dynamics classification models were used to build multiple variants of the models.
They were built for the same 8 participants as in the individual dynamics regression models.
The evaluation results can be found in Table 5.10. The metrics that are presented were
calculated for each participant and for the entire PAD-tuple when it concerns models that
use the joint PAD space. When it concerns models that use separated PAD dimensions,
they were calculated for each dimension of the PAD space and then averaged to obtain a
result for the entire model. Note that the results for models that use the joint PAD space
in combination with the positive/neutral/negative class definition were left out as this class
definition leads to 27 possible classes and none of the participant’s data contained each of
these classes, which makes some of the measures undefined. The ROC AUC score indicates
that the best result on average is obtained with the combination of including the weather
data, using the separated PAD space in which each axis is divided into a positive, negative
and neutral class, not using class weight balancing, using subsampling and using feature
selection (bold). This is quite a different model compared to the best model that was found
in general dynamics classification. It obtains an average ROC AUC score of 0.677 and a
F1 score of 0.564 (± 0.271). However, when again the models that perform slightly worse
in terms of ROC AUC score but much better in terms of precision, recall and F1 scores in
general dynamics classification are observed, the same phenomenon is found for individual
Table 5.9: Confusion matrix for best general dynamics classification model
1 2 3 4 5 6 7 8
1 55.68% 5.68% 5.68% 7.92% 5.68% 14.8% 3.44% 1.12%
2 7.92% 54.56% 4.56% 2.24% 10.24% 3.44% 6.8% 10.24%
3 7.92% 7.92% 40.88% 5.68% 13.6% 3.44% 7.92% 12.48%
4 4.56% 7.92% 1.12% 65.92% 6.8% 2.24% 5.68% 5.68%
5 6.8% 5.68% 6.8% 3.44% 62.48% 10.24% 4.56% 0%
6 3.44% 1.12% 3.44% 2.24% 6.8% 77.28% 2.24% 3.44%
7 6.8% 9.12% 4.56% 7.92% 4.56% 6.8% 51.12% 9.12%
8 5.68% 6.8% 12.48% 4.56% 3.44% 2.24% 12.48% 52.24%
67
Chapter 5. Model Building
models. All models using the separated PAD space using classes determined by the k-means
clusters (with k = 2) or by the positive/negative class definition, no class weight balancing
and subsampling (underlined) have slightly lower ROC AUC scores but much higher precision,
recall and F1 scores. This is exactly the same family of models that was found in general
dynamics classification.
Table 5.10: Individual dynamics classification models results (CWB = class weight balancing, SS =
subsampling)
Weather PAD Classes CWB SS Features Precision Recall F1ROC
AUC
No
Joint
K-means
(k=8)
GeneralNo All 0.172 0.212 0.181 0.513
No Best 20 0.202 0.234 0.208 0.528
SubsampleNo All 0.186 0.218 0.190 0.516
No Best 20 0.209 0.234 0.210 0.528
No
No All 0.172 0.224 0.184 0.518
No Best 20 0.212 0.251 0.219 0.537
Yes All 0.231 0.272 0.244 0.584
Yes Best 20 0.262 0.285 0.265 0.592
+/-
GeneralNo All 0.277 0.407 0.309 0.522
No Best 20 0.279 0.397 0.304 0.518
SubsampleNo All 0.256 0.390 0.293 0.511
No Best 20 0.268 0.395 0.306 0.517
No
No All 0.276 0.405 0.312 0.521
No Best 20 0.319 0.448 0.352 0.552
Yes All 0.170 0.188 0.173 0.536
Yes Best 20 0.221 0.219 0.212 0.554
Separate
K-means
(k=2)
GeneralNo All 0.471 0.389 0.404 0.544
No Best 20 0.585 0.463 0.493 0.598
SubsampleNo All 0.468 0.388 0.402 0.542
No Best 20 0.584 0.469 0.496 0.603
No
No All 0.492 0.405 0.420 0.543
No Best 20 0.561 0.461 0.488 0.593
Yes All 0.658 0.633 0.644 0.648
Yes Best 20 0.643 0.617 0.629 0.639
+/-
GeneralNo All 0.650 0.677 0.637 0.533
No Best 20 0.701 0.711 0.682 0.580
SubsampleNo All 0.656 0.673 0.634 0.533
No Best 20 0.697 0.713 0.686 0.583
No
No All 0.625 0.666 0.631 0.535
No Best 20 0.686 0.706 0.680 0.583
Yes All 0.653 0.664 0.657 0.653
Yes Best 20 0.620 0.642 0.629 0.624
+/0/-
GeneralNo All 0.508 0.595 0.533 0.519
No Best 20 0.557 0.625 0.570 0.549
SubsampleNo All 0.496 0.594 0.529 0.517
No Best 20 0.549 0.619 0.565 0.546
68
Chapter 5. Model Building
No
No All 0.514 0.591 0.537 0.523
No Best 20 0.559 0.623 0.575 0.556
Yes All 0.441 0.443 0.435 0.582
Yes Best 20 0.513 0.511 0.507 0.633
Yes
Joint
K-means
(k=8)
GeneralNo All 0.179 0.234 0.194 0.513
No Best 20 0.201 0.254 0.218 0.526
SubsampleNo All 0.168 0.226 0.186 0.506
No Best 20 0.215 0.264 0.228 0.531
No
No All 0.162 0.228 0.182 0.505
No Best 20 0.191 0.254 0.212 0.523
Yes All 0.271 0.280 0.265 0.588
Yes Best 20 0.286 0.308 0.291 0.604
+/-
GeneralNo All 0.313 0.404 0.337 0.556
No Best 20 0.298 0.379 0.321 0.540
SubsampleNo All 0.323 0.416 0.347 0.562
No Best 20 0.300 0.382 0.325 0.542
No
No All 0.314 0.407 0.336 0.559
No Best 20 0.339 0.431 0.369 0.581
Yes All 0.229 0.297 0.253 0.598
Yes Best 20 0.242 0.313 0.269 0.607
Separate
K-means
(k=2)
GeneralNo All 0.449 0.293 0.324 0.530
No Best 20 0.499 0.349 0.385 0.569
SubsampleNo All 0.469 0.296 0.330 0.532
No Best 20 0.478 0.342 0.374 0.567
No
No All 0.411 0.313 0.337 0.531
No Best 20 0.483 0.351 0.386 0.560
Yes All 0.671 0.635 0.649 0.660
Yes Best 20 0.651 0.653 0.650 0.646
+/-
GeneralNo All 0.609 0.631 0.586 0.530
No Best 20 0.632 0.649 0.619 0.573
SubsampleNo All 0.621 0.624 0.585 0.530
No Best 20 0.620 0.642 0.609 0.570
No
No All 0.570 0.631 0.586 0.529
No Best 20 0.652 0.660 0.631 0.567
Yes All 0.674 0.679 0.674 0.671
Yes Best 20 0.668 0.680 0.672 0.673
+/0/-
GeneralNo All 0.532 0.615 0.549 0.523
No Best 20 0.578 0.627 0.579 0.545
SubsampleNo All 0.538 0.618 0.550 0.524
No Best 20 0.579 0.627 0.577 0.543
No
No All 0.518 0.602 0.547 0.520
No Best 20 0.568 0.632 0.584 0.559
Yes All 0.461 0.467 0.459 0.600
Yes Best 20 0.571 0.569 0.564 0.677
69
Chapter 5. Model Building
5.2.3 General Text Content Models
To build general text content classification models, the text content feature data of all par-
ticipants is used to train a SVM for classification4. There are four degrees of freedom in
these models. The first degree of freedom is again whether a model is built for each separate
dimension of the PAD model or whether one model is built to predict the entire PAD-tuple.
The second degree of freedom is the type of class system that is used. The third degree of
freedom is the type of class weight balancing that is used. The fourth degree of freedom is
whether or not subsampling is performed. In total, 18 different SVM models for classification
were built and evaluated. The evaluation results can be found in Table 5.11. The metrics
that are presented were calculated for the entire PAD-tuple when it concerns models that use
the joint PAD space and for each dimension of the PAD space and then averaged to obtain a
result for the entire model when it concerns models that use separated PAD dimensions. The
ROC AUC score indicates that the best result is obtained with the combination of using the
joint PAD space that is divided into classes according to the k-means clusters (k = 8), no class
weight balancing and subsampling (bold). This model was built using 704 labeled samples
and achieves a ROC AUC score of 0.69 and a F1 score of 0.46 (± 0.059). The corresponding
confusion matrix is presented in Table 5.12. However, again note that all models built on
separate PAD dimensions using classes determined by the k-means clusters (with k = 2) or by
the positive/negative class definition, no class weight balancing and subsampling (underlined)
have similar ROC AUC scores to this model but also have much higher precision, recall and
F1 scores. Again one could argue that these models actually perform better.
5.2.4 Individual Text Content Models
For the individual text content classification models the same degrees of freedom as for the
general text content classification models were used to build multiple variants of the models.
They were built for the same 13 participants as in the individual text content regression
models. The evaluation results can be found in Table 5.13. The metrics that are presented
were calculated for each participant and for the entire PAD-tuple when it concerns models
that use the joint PAD space. When it concerns models that use separated PAD dimensions,
they were calculated for each dimension of the PAD space and then averaged to obtain a
result for the entire model. Again note that the results for models that use the joint PAD
space in combination with the positive/neutral/negative class definition were left out because
of the same reasons as in the individual dynamics classification models. The ROC AUC
score indicates that the best result on average is obtained with the combination of using the
separated PAD space in which the classes are defined according to the k-means clusters (with
k = 2), no class weight balancing and subsampling (bold). An average ROC AUC score of
4C-Support Vector Classification: http://scikit-learn.org/stable/modules/generated/sklearn.svm.
SVC.html
70
Chapter 5. Model Building
Table 5.11: General text content classification models results (CWB = class weight balancing, SS =
subsampling)
PAD Classes CWB SS Precision Recall F1ROC
AUC
Joint
K-means
(k=8)
General No 0.268019722 0.275276126 0.266816902 0.57673234
No No 0.307464182 0.289719626 0.221705095 0.561704624
No Yes 0.45753873 0.458806818 0.456671627 0.690746753
+/-
General No 0.333963658 0.283772302 0.293695074 0.56930125
No No 0.387398865 0.400169924 0.263543947 0.525877128
No Yes 0.333055677 0.31402439 0.320677112 0.608013937
+/0/-
General No 0.253132625 0.143585387 0.149888959 0.538395204
No No 0.240808258 0.351741716 0.205233201 0.525977603
No Yes 0.037037037 0.037037037 0.037037037 0.5
Separate
K-means
(k=2)
General No 0.624411948 0.601834804 0.612769593 0.614829803
No No 0.633501525 0.624903002 0.583944684 0.57153326
No Yes 0.68553396 0.662304757 0.672002906 0.677553096
+/-
General No 0.710538281 0.70903735 0.709616053 0.615116742
No No 0.667320598 0.875335615 0.756375021 0.570957361
No Yes 0.664575639 0.692656261 0.677612159 0.671396258
+/0/-
General No 0.574958879 0.553125876 0.561861179 0.610086922
No No 0.552407486 0.596860107 0.535415923 0.568419015
No Yes 0.528597285 0.514981702 0.516663227 0.636236276
Table 5.12: Confusion matrix for best general text content classification model
1 2 3 4 5 6 7 8
1 50% 17.04% 1.12% 9.12% 10.24% 4.56% 4.56% 3.44%
2 4.72% 51.12% 2.24% 4.56% 13.6% 6.8% 11.36% 5.68%
3 5.68% 11.36% 55.68% 2.24% 6.8% 2.24% 10.24% 5.68%
4 7.92% 15.92% 11.36% 34.08% 5.68% 1.12% 13.6% 10.24%
5 10.24% 21.6% 13.6% 2.24% 34.08% 5.68% 5.68% 6.8%
6 5.68% 12.48% 4.56% 4.56% 13.6% 45.44% 7.92% 5.68%
7 3.44% 18.16% 3.44% 9.12% 6.8% 4.56% 52.24% 2.24%
8 9.12% 22.72% 6.8% 10.24% 17.04% 3.44% 3.44% 27.28%
71
Chapter 5. Model Building
0.60 and a F1 score of 0.61 (± 0.21) is achieved. This is one of the models that were found
to yield good performance as well for general text content classification models. Moreover,
also the combination of using the separated PAD space in which each axis is divided into a
positive and a negative class, no class weight balancing and subsampling (underlined) yields
very similar performance to the best model. This is again one of the models that were found
to yield good performance for general text content classification models. This class of models
also has satisfying precision, recall and F1 scores.
72
Chapter 5. Model Building
Table 5.13: Individual text content classification models results (CWB = class weight balancing, SS
= subsampling)
PAD Classes CWB SS Precision Recall F1ROC
AUC
Joint
K-means
(k=8)
General No 0.2048145 0.221690029 0.200216664 0.532035625
No No 0.144275293 0.277016936 0.177469156 0.525336029
No Yes 0.246539472 0.196759259 0.208456228 0.541005291
+/-
General No 0.273186198 0.282140001 0.264090733 0.51690944
No No 0.261486105 0.39437678 0.282478137 0.497085081
No Yes 0.095089286 0.057291667 0.067899548 0.461309524
Separate
K-means
(k=2)
General No 0.456824801 0.457858561 0.450353145 0.539618275
No No 0.385719024 0.380189637 0.357320477 0.540389041
No Yes 0.602536765 0.628500278 0.606165789 0.598757394
+/-
General No 0.622120277 0.669521918 0.641281107 0.54520671
No No 0.578284116 0.712970634 0.63220738 0.540734206
No Yes 0.586616834 0.618658291 0.596331473 0.594683574
+/0/-
General No 0.57161971 0.595287325 0.572264152 0.552319046
No No 0.546616105 0.630107456 0.561077139 0.542007658
No Yes 0.354593679 0.343848047 0.341829526 0.507886036
73
Chapter 6
Discussion
In this chapter some additional discussion and result analysis is presented in order to draw
more specific conclusions about which models perform good or bad and which model aspects
have a large or small influence on this performance. Furthermore, some extra thought is given
to other possibilities for achieving better performances.
6.1 Overall Results
The best overall results are summarized in Table 6.1. The regression models that were built
based on the dynamics features did not yield good performances in general. Both the general
and individual models yield relatively high errors on average and do not present predictions
that are significantly better than always predicting the mean value. This does not mean
that regression models cannot be built for this application. The best explanation for the
bad performances is the limited size and bias of the dataset. Possibly, a larger dataset that
contains more uniformly distributed samples will yield better regression models. Also the
models that were built using the text content features did not perform well. This can again
be explained by the limited size of the dataset. The amount of text content features that are
Table 6.1: Summary of best overall results
Model type # samples R2/
Precision
MAE/
Recall
MSE
/F1
EVS/
ROC
AUC
Regression
GeneralDynamics 630 0.177 17.66 453.38 0.177
Text content 1177 0.036 19.69 510.28 0.042
IndividualDynamics N/A 0.048 16.19 394.38 0.049
Text content N/A -0.051 17.64 456.57 -0.034
Classification
GeneralDynamics 704 0.722 0.735 0.728 0.726
Text content 704 0.686 0.662 0.672 0.678
IndividualDynamics N/A 0.658 0.633 0.644 0.648
Text content N/A 0.603 0.629 0.606 0.599
74
Chapter 6. Discussion
extracted on average is much higher than the amount of samples that are available, resulting
in a very sparse feature space. Furthermore, the quality of the text content itself is also
very limited (e.g. dialects, chat language, different languages) which again results in lower
performance.
The classification models presented much better results and a clear pattern of a family of
models that perform very well. For both general and individual models, it can be seen that
the best models are the ones that are built using separate PAD dimensions, either divided into
classes using k-means clusters or according to the positive/negative class definition, not using
class weight balancing and using subsampling. This is expected as using the entire PAD-
tuple results in a multi-class problem, which is harder to solve than three binary classification
problems resulting from using separate PAD dimensions. The fact that the models, using
k-means clustering to generate classes, perform very similar to the models using the classes
generated according to the positive/negative class definition can be explained by the value
distributions for each PAD axis. These are shown in Figure 6.1. Two clusters of values for
each PAD axis can clearly be seen, which will very likely be the same clusters that the k-means
clustering algorithm will find. Presumably, the presence of these two obvious clusters is caused
by the tendency of the participants to always move the sliders while indicating their emotional
state, resulting in the slider being either positioned to the left or right of the center but
never in the center itself. These two clusters align very closely to the positive/negative class
definition which means that these two ways of defining classes will yield very similar classes.
Since the produced classes are very similar, the results will also be very similar. Indeed,
Table 6.2 presents the clusters generated by the k-means algorithm (with k = 2) and shows
that the generated classes are very similar to the ones generated by the positive/negative class
definition. One of the main reasons for the better performance of classification models can
probably be attributed to the subsampling technique, solving the class skew problem. This
was one of the main limitations for the regression models. A similar intelligent technique
could maybe be applied to the continuous dataset as well to make the regression models
perform better as well.
Table 6.2: Clusters generated by the 2-means algorithm for separated PAD dimensions
PAD dimension Cluster centroid # samples
Pleasure30.61238532 436
71.90418354 741
Arousal28.89059501 521
72.98170732 656
Dominance32.20266667 375
73.7319202 802
75
Chapter 6. Discussion
0-10
10-2
0
20-3
0
30-4
0
40-5
0
50-6
0
60-7
0
70-8
0
80-9
0
90-1
000
50
100
150
200
250
Pleasure
Am
ount
(a) Distribution of pleasure axis values
0-10
10-2
0
20-3
0
30-4
0
40-5
0
50-6
0
60-7
0
70-8
0
80-9
0
90-1
000
50
100
150
200
250
Arousal
Am
ount
(b) Distribution of arousal axis values
0-10
10-2
0
20-3
0
30-4
0
40-5
0
50-6
0
60-7
0
70-8
0
80-9
0
90-1
000
50
100
150
200
250
Dominance
Am
ount
(c) Distribution of dominance axis values
Figure 6.1: Distributions of PAD axis values
76
Chapter 6. Discussion
The effects of applying class weight balancing in the classification models were negligible,
indicating that this technique does not seem to work. Finally, the models built using text
content features do not perform very well compared to the models built using dynamics
features. An important reason for this is probably the fact that the text content contains
multiple languages and, more importantly, dialects and so-called chat language. This results
in many different ways of spelling the same word. Furthermore, there is no real detection
of emotionally charged words, which could help improving the performance of text content
models.
6.2 Contextual Data
In this research contextual data was used in the form of weather information. It seems that
features concerning weather information (temperature, pressure, humidity and discomfort
index) do not make much of a difference for model performances. However, for regression
models there is a pattern indicating that weather features might be useful. One of the main
problems is that including weather data causes a lot of samples to be removed, which in turn
causes a decrease in model performance due to a limited dataset. At first sight, one might
draw the conclusion that weather data does not provide discriminative information to the
model. However, in classification, the decrease of performance, due to the removal of samples
that do not contain weather data, is much smaller and almost non-existent, indicating that
weather features might provide useful data. Thus, due to the limited size of the dataset, the
effects of using contextual data are obfuscated. Further research using larger datasets might
give more insight into the value of contextual data for emotion recognition.
6.3 Feature Selection
Using feature selection for models does not have a significant impact on the results. This is
consistent with the feature importances for the best general dynamics regression and classi-
fication models provided by the Random Forest algorithm shown in Figure 6.2.
6.4 Subsampling
In the previous chapter it was noticed that subsampling improves the classification model
results significantly (almost 20%). This is because it solves the class skew problem. How-
ever, the downside of subsampling is that a lot of data gets lost and is simply not used in
these models. To illustrate that the subsampling approach is a valid one, a technique called
bootstrap subsampling can be used. Bootstrap subsampling builds multiple bootstrap models
using all data of the least represented class (as is done in normal subsampling as well) and
each bootstrap model using a different random subset of the data in the other classes. The
77
Chapter 6. Discussion
average of the scores of each bootstrap model is then the score for the entire model. It can
provide a more reliable way of evaluation as in this technique more data is used. Table 6.3
presents the results of bootstrap subsampling applied to the general dynamics classification
model that combines not including weather data, using separate PAD dimensions with classes
defined by the k-means clusters (with k = 2), no class weight balancing, subsampling and no
feature selection. The downside of this technique is that it is very time-consuming. Therefore
it was not applied to all models that were built using subsampling.
6.5 Prediction Performance for PAD-dimensions
From the general dynamics regression results for models that predict for separate PAD dimen-
sions can be concluded that the dominance dimension can be predicted more accurately than
the pleasure and arousal dimension, according to the R2 scores and MAE values. However, it
is important here to again take the value distribution of this axis into account (see Figure 6.1).
This distribution clearly shows that there is a large bias for the dominance dimension, which
explains the lower PAD values. However, this does not explain the fact that also the R2
scores are higher for the dominance dimension in comparison to the other dimensions. This
fact cannot be explained by the bias on the value distribution as this score is relative to always
taking the mean as a prediction. The mean value of the dominance dimension is also affected
when the value distribution is biased and thus this bias does not result in a higher R2 score,
meaning that the features are more informative for the dominance dimension compared to
the other dimensions.
The same phenomenon is observed in the general dynamics classification results for models
that use separate PAD dimensions. The dominance dimension achieves significantly higher
precision, recall and F1 scores compared to the other dimensions. The ROC AUC scores for
the dominance dimension are directly proportional to those for the other dimensions.
6.6 Number of Samples
The model performance stays stable when the number of training samples is changed for
general dynamics regression models. Table 6.4 contains the results for the general dynamics
regression model that combines including weather data, using the joint PAD space and using
feature selection with various training set sizes. This is the case as well when the individual
model results from different participants are compared. Also general dynamics classification
model performances stay stable when changing the number of training samples. Table 6.5
contains the results for various training set sizes for the general dynamics classification model
that combinines not including weather data, using separate PAD dimensions with classes
defined by the k-means clusters (with k = 2), no class weight balancing, subsampling and no
78
Chapter 6. Discussion
Table 6.3: Bootstrap subsampling results for general dynamics classification model
Round Precision Recall F1 ROC AUC
1 0.717585912 0.726206609 0.72175615 0.720000712
2 0.707315101 0.723823526 0.71546539 0.712108198
3 0.718066776 0.739734021 0.728667081 0.724471576
4 0.718576649 0.723644485 0.720651117 0.719759138
5 0.71464299 0.726986909 0.720491328 0.71761579
6 0.729068549 0.741013612 0.734839796 0.732444988
7 0.72155289 0.735463468 0.728329377 0.72554445
8 0.710272011 0.719003418 0.714465444 0.712470473
9 0.724458701 0.72897929 0.7261802 0.725351842
10 0.729446734 0.742026128 0.735546113 0.733145494
Average 0.719098631 0.730688147 0.7246392 0.722291266
feature selection. The same holds for the individual models. Thus, it can be concluded that
the number of samples does not have a significant influence on the model performance but it
is rather the sample value distributions that influence the results.
6.7 Model Hyperparameters
By optimizing the model hyperparameters, the results can be improved even more. For the
dynamics regression models, it was found that the best results are obtained using the random
forest containing the original 500 trees but using a maximum number of features used per tree
that is equal to the total number of available features, instead of the square root of the total
number of available features. For the dynamics classfication models, it was found that the
best results are obtained using the random forest containing the original 500 trees but using
a maximum number of features used per tree that is equal to the binary logarithm of the
total number of available features, instead of the square root of the total number of available
features. The tables containing the results of the optimization processes can be found in
Appendix E.
Table 6.4: Results for general dynamics regression model with various training set sizes
Nr of samples R2 MAE MSE EVS
100 0.22524175 13.6874 296.6768386 0.225479194
200 0.200609351 13.97993333 301.8228383 0.200895355
300 0.164304591 15.50327333 374.3565912 0.164358175
400 0.207546542 15.77119833 377.2240982 0.207925006
500 0.20879954 16.859352 413.9270148 0.209297264
600 0.174548003 17.16563111 428.6194474 0.175037163
79
Chapter 6. Discussion
Table 6.5: Results for general dynamics classification model with various training set sizes
Nr of samples Precision Recall F1 ROC AUC
100 0.673879142 0.674569402 0.672268673 0.669756839
200 0.707781439 0.656543907 0.680428146 0.691684817
300 0.741407802 0.733825815 0.736972541 0.737031076
400 0.735368727 0.70652697 0.719417635 0.725463714
500 0.71523112 0.708420543 0.711343485 0.714561319
600 0.698434725 0.71025146 0.704286078 0.701755464
700 0.711076088 0.725112248 0.716872164 0.712789109
800 0.703133438 0.720240645 0.711293069 0.706909115
6.8 Clustered Models
Both general models, that do not distinguish between different persons, and individual models,
that are built only using data of one person, were built. Except for these types of models,
also clustered models could be built. The concept of clustered models is, just like individual
models, also based on the assumption that every person is unique. However, it is assumed
that it is possible that there exists a finite number of clusters or user profiles, that comprise
people that show similar computer behaviour for each emotional state. A clustered model
will first try to define these profiles and then try to assign each person to some profile. Next,
it will build one model for each profile using only data from persons assigned to these profiles.
These profiles can be detected based on similar personality characteristics using a standard
clustering algorithm such as k-means clustering. To do this, more participants are needed
and more extensive personality questionnaires should be taken from all participants.
6.9 Dataset
We have prepared the raw dataset that was used in this research for future use. Therefore,
some precautions needed to be taken. During the data collection process, all keystroke,
mouse movement and location data of each participant were collected. This data has already
been made anonymous by assigning specific irreversible identifiers to each participant. Some
corresponding personal information, such as the participant’s name, email and place of birth
will be discarded. Furthermore, as all keystrokes were collected, the data also contains email
addresses, passwords and other potentially sensitive information that should not be made
publicly available. Editing this data manually would be an infeasible job. Therefore, we
opted to apply a substitution code to the alphabetic data. This way, the actual textual
content becomes unreadable but all of the dynamics features still make sense. Corresponding
keydown and keyup pairs can still be found and a distinction between alphabetic, numerical
and special characters still exists. To maintain the hierarchical structure of the entire dataset,
80
Chapter 6. Discussion
we opted to make it available in a JSON format. Note that the dataset contains raw data
and does not provide pre-calculated features.
81
Chapter 6. Discussion
(a) Regression
(b) Pleasure classification
82
Chapter 6. Discussion
(c) Arousal classification
(d) Dominance classification
Figure 6.2: Feature importances for best general dynamics models
83
Chapter 7
Conclusion
The techniques that are currently available for recognizing emotions of computers are often
expensive, intrusive and therefore not really suited for application in home or office envi-
ronments. The real-world application possibilities for affective computing solutions are thus
limited by different factors.
This research covered a solution that analyzes users’ keystroke dynamics, textual content,
mouse movements and some contextual factors to recognize their emotional state. This solu-
tion overcomes many of the limitations of current techniques. It is a non-intrusive solution as
the user is not aware of the ongoing monitoring process. Furthermore, this solution only uses
inexpensive and widely used computer equipment. This makes the technique very interesting
to use for real-world application of affective computing solutions in present-day computer
systems.
Data collection software was developed and used to perform a field study. In this field study,
an experience sampling methodology was used to collect samples, in which participants had
to report their emotional state on the PAD dimensional emotion model. These reports were
taken as ground truth values and features were extracted from the raw monitoring data.
These ground truth values were then used to build emotional recognition models based on
the extracted features.
7.1 Lessons Learned
One of the main limitations for this research was the reliance on the participants’ self-reports
to obtain the ground truth values. This has the disadvantage that is entirely possible that
participants provide incorrect emotional state information, either intentional or not. However,
there is no objective method for emotion identification that lends itself to the experience
sampling methodology. Furthermore, the usage of this methodology requires participants to
have a good understanding of the emotion model that is used to indicate their emotional
84
Chapter 7. Conclusion
states. The PAD model has the disadvantage that the meaning of the different dimensions is
not always clear without proper explanation. Of course, there are also advantages to using
the experience sampling methodology. It allows for an easy remote data collection process
over long periods of time. Furthermore, participants are not influenced by the data collection
process, as is the case in laboratory settings.
Another problem that was observed in this research is the class skew problem. This problem
is again caused by the usage of the experience sampling methodology as this methodology
does not allow for inducing emotions in participants such that a dataset, that is uniformly
distributed over the PAD space, cannot be obtained without manipulation. Classification
appeared to be easier than regression. This is mainly due to this class skew and the size
of the dataset. Classification allows for subsampling, solving the class skew problem, which
improves the model perfomance significantly. However, subsampling also has the disadvantage
of removing a lot of useful data. To allow all data to be used, bootstrap subsampling can be
used.
In general, the dataset that was used in this research was not very big. More training
samples would first of all, result in a better coverage of the PAD space and secondly, result
in less dramatic effects of performing subsampling. Individual models were built as well but
this proved to be difficult due to an insufficient amount of samples per person and also the
manifestation of the class skew problem on the individual level.
7.2 Contributions
This section presents a brief summary of the main contributions in this research.
First, the pleasure-arousal-dominance emotion model was used providing a more discrimina-
tive way of defining emotions. This emotion model also allows for regression techniques to
be applied to the emotion recognition task. In other studies, either discrete emotion mod-
els (based on classes), one-dimensional models (such as PANAS) or two-dimensional models
(such as the PA model) were used, which are all emotion models that can be ambiguous for
some emotional states.
Second, free text data was used as a basis for the models in this research in contrast to many
other studies that used fixed text. Although a free text approach is a very difficult approach,
the potential for real-world application is much bigger for free text as it allows users to work
without restraint while still collecting information about their emotional state.
Third, models were built based on a combination of dynamics (both keystroke and mouse)
features and contextual features (i.e. weather data) and compared to investigate the added
85
Chapter 7. Conclusion
value of contextual features. To allow weather data to be used, an easy workflow was pro-
vided that only depends on the user’s location services settings and an internet connection.
This research has shown that weather data may contain valuable information for emotion
recognition but further examination is required to confirm this.
Fourth, both general models and individual models were built to investigate the possible
advantages of using more user-specific models to make accurate predictions. The results in
this research for individual models are promising but more data and especially more uniformly
distributed data is required to be able to draw more reliable conclusions on this matter.
Fifth, a methodology was presented to use regression models in more practical settings using
fuzzy logic. This methodology also allows for less accurate predictions to be used, as the
inherent fuzziness and uncertainty can be easily handled using fuzzy rules.
Sixth, good classification models were created for predicting two class-levels on each separate
dimension of the PAD model based on dynamics features. The best model achieved a classi-
fication accuracy of 0.73, a precision of 0.72, a recall of 0.74, a F1 score of 0.73 and a ROC
AUC score of 0.73.
Finally, an easy-to-use dataset was constructed with a hierarchical structure and prepared for
public use. One of the main problems in this research area is that most studies use a different
dataset, different metrics, different features and/or different emotion models. This causes a
lot of difficulties when comparing findings and results. By allowing our dataset to be used in
other studies, the problem of the usage of different datasets and emotion models is solved.
7.3 Potential for Application
This study presents classification models that are suitable for real-world application. Re-
gression models would be even more interesting and specific but solutions that are suited for
practical use could not be created yet. The real-world application of these techniques also
impose some ethical concerns for privacy, as the monitoring process that is used does not get
noticed by the user. The collected data may very well serve as a way to recognize a user’s af-
fective state but it may also cause a lot of harm when the data falls into the wrong hands as it
may contain passwords and other sensitive information. Notifying the user of the background
monitoring process allows for a consent but counteracts the goal of the concealment.
7.4 Future Work
To further improve the models based on the presented techniques, bigger and more uniformly
distributed datasets need to be created. A dataset does not necessarily have to be distributed
86
Chapter 7. Conclusion
uniformly. When the dataset contains enough data for each class, subsampling can be ap-
plied to make the dataset uniformly distributed. There also exist alternative techniques to
subsampling that can be investigated. One such alternative is oversampling. However, this
has the advantage that all available data is used but has the disadvantage that it may lead
to overfitting on the specific dataset.
It would also be interesting to examine the possibility of using the clustered models that were
proposed. Therefore, a lot more participants are needed such that clusters of participants can
be made and models can be built for each cluster and then compared to observe if there is
any improvement.
In this research, all samples were assumed to be mutually independent. However, when
observing samples at an individual level, it makes sense to assume that emotional states in
these samples depend on each other. For example, a person who is exuberant at time t is
more likely to be relaxed at time t+1 than to be bored. Therefore, it is interesting to explore
the use of time series on an individual level.
Furthermore, other features can be used as a replacement or in addition to the features used
in this study. Possible features might be the sex of the user, whether the user is right or left
handed, the keyboard layout of the user, application context and typing proficiency. Note
that this information is already available in the current dataset. It may also be interesting to
investigate other techniques for outlier removal in order to reduce the noise in the dataset.
As was mentioned before, one of the limitations of this study was the reliance on the partici-
pants for establishing the ground truth. To overcome this, other techniques may be used for
identifying the participants emotional state such as facial emotion recognition using a camera.
Finally, the usage of textual content can be improved by detecting emotionally charged words.
This can be done using a hand-crafted dictionary of emotionally charged words. Also text
normalization and stemming techniques might be used in order to optimize the text inter-
pretation process and to allow the handling of dialects. It can also be interesting to perform
language detection and then use the corresponding processing settings for that language.
However, note that the dataset that is provided has been encrypted and textual content anal-
ysis is not possible or would not make any sense. Another approach to text content analysis
would be to collect an actual set of texts (e.g. posts on social media) and apply machine
learning techniques to this dataset.
A number of techniques for recognizing a user’s emotional state based on typing behaviour,
mouse movements, contextual information and textual content were presented and explored.
The results in this research show that it is possible to do this using hardware that is commonly
available today which makes the solution particularly suitable for real-world application.
87
Bibliography
[1] Artificial neural network illustration. https://en.wikipedia.org/wiki/Artificial_
neural_network.
[2] Bayesian network example. http://www.ra.cs.uni-tuebingen.de/software/JCell/
tutorial/ch03s03.html.
[3] Decision tree example. https://alliance.seas.upenn.edu/~cis520/wiki/index.
php?n=Lectures.DecisionTrees.
[4] Em-algorithm example. http://www.bobnirma.com/tag/expectation-maximization.
[5] k-means clustering example. https://apandre.wordpress.com/visible-data/
cluster-analysis/.
[6] k-nn classification illustration. http://bdewilde.github.io/blog/blogger/2012/10/
26/classification-of-hand-written-digits-3/.
[7] Kernel trick illustration. http://www.eric-kim.net/eric-kim-net/posts/1/kernel_
trick.html.
[8] Lovheim cube of emotion. https://en.wikipedia.org/wiki/Emotion_
classification.
[9] Machine learning. https://en.wikipedia.org/wiki/Machine_learning.
[10] Neuron illustration. https://en.wikibooks.org/wiki/Artificial_Neural_
Networks/Print_Version.
[11] Random forest classification illustration. http://file.scirp.org/Html/6-9101686_
31887.htm.
[12] Support vector machine classification illustration. https://www.dtreg.com/solution/
view/20.
[13] Areej Alhothali. Modeling user affect using interaction events. 2011.
88
Bibliography
[14] Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat. Emotions from text: machine
learning for text-based emotion prediction. In Proceedings of the conference on Human
Language Technology and Empirical Methods in Natural Language Processing, pages 579–
586. Association for Computational Linguistics, 2005.
[15] Kaveh Bakhtiyari and Hafizah Husain. Fuzzy model in human emotions recognition.
arXiv preprint arXiv:1407.1474, 2014.
[16] Kaveh Bakhtiyari and Hafizah Husain. Fuzzy model of dominance emotions in affective
computing. Neural Computing and Applications, 25(6):1467–1477, 2014.
[17] Kaveh Bakhtiyari, Mona Taghavi, and Hafizah Husain. Implementation of emotional-
aware computer systems using typical input devices. In Intelligent Information and
Database Systems, pages 364–374. Springer, 2014.
[18] Margaret M Bradley and Peter J Lang. Affective norms for english words (anew): In-
struction manual and affective ratings. Technical report, Technical Report C-1, The
Center for Research in Psychophysiology, University of Florida, 1999.
[19] Joost Broekens. In defense of dominance: Pad usage in computational representations
of affect. International Journal of Synthetic Emotions (IJSE), 3(1):33–42, 2012.
[20] Gerald L Clore, Andrew Ortony, and Mark A Foss. The psychological foundations of the
affective lexicon. Journal of personality and social psychology, 53(4):751, 1987.
[21] Paul Ekman. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200,
1992.
[22] Paul Ekman and Wallace V Friesen. Constants across cultures in the face and emotion.
Journal of personality and social psychology, 17(2):124, 1971.
[23] Clayton Epp, Michael Lippold, and Regan L Mandryk. Identifying emotional states
using keystroke dynamics. In Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, pages 715–724. ACM, 2011.
[24] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical resource
for opinion mining. In Proceedings of LREC, volume 6, pages 417–422. Citeseer, 2006.
[25] Michael Fairhurst, Da Costa-Abreu, et al. Using keystroke dynamics for gender identi-
fication in social network environment. In Imaging for Crime Detection and Prevention
2011 (ICDP 2011), 4th International Conference on, pages 1–6. IET, 2011.
[26] Joseph P Forgas. Mood and judgment: the affect infusion model (aim). Psychological
bulletin, 117(1):39, 1995.
89
Bibliography
[27] R Stockton Gaines, William Lisowski, S James Press, and Norman Shapiro. Authentica-
tion by keystroke timing: Some preliminary results. Technical report, DTIC Document,
1980.
[28] Hatice Gunes, Bjorn Schuller, Maja Pantic, and Roddy Cowie. Emotion representation,
analysis and synthesis in continuous space: A survey. In Automatic Face & Gesture
Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, pages
827–834. IEEE, 2011.
[29] Joel M Hektner, Jennifer A Schmidt, and Mihaly Csikszentmihalyi. Experience sampling
method: Measuring the quality of everyday life. Sage, 2007.
[30] Amaury Hernandez-Aguila, Mario Garcia-Valdez, and Alejandra Mancilla. Affective
states in software programming: Classification of individuals based on their keystroke
and mouse dynamics. Intelligent Learning Environments, page 27, 2014.
[31] Holger Hoffmann, Andreas Scheck, Timo Schuster, Steffen Walter, Kerstin Limbrecht,
Harald C Traue, and Henrik Kessler. Mapping discrete emotions into the dimensional
space: An empirical approach. In Systems, Man, and Cybernetics (SMC), 2012 IEEE
International Conference on, pages 3316–3320. IEEE, 2012.
[32] Edgar Howarth and Michael S Hoffman. A multidimensional approach to the relationship
between mood and weather. British Journal of Psychology, 75(1):15–23, 1984.
[33] Carroll E Izard. Four systems for emotion activation: cognitive and noncognitive pro-
cesses. Psychological review, 100(1):68, 1993.
[34] Philip Nicholas Johnson-Laird and Keith Oatley. The language of emotions: An analysis
of a semantic field. Cognition and emotion, 3(2):81–123, 1989.
[35] Christian Kaernbach. On dimensions in emotion psychology. In Automatic Face &
Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference
on, pages 792–796. IEEE, 2011.
[36] A Ko lakowska, A Landowska, M Szwoch, W Szwoch, and MR Wrobel. Emotion recog-
nition and its applications. In Human-Computer Systems Interaction: Backgrounds and
Applications 3, pages 51–62. Springer, 2014.
[37] Agata Kolakowska. A review of emotion recognition methods based on keystroke dy-
namics and mouse movements. In Human System Interaction (HSI), 2013 The 6th In-
ternational Conference on, pages 548–555. IEEE, 2013.
[38] Agata Kolakowska. Recognizing emotions on the basis of keystroke dynamics. In Human
System Interactions (HSI), 2015 8th International Conference on, pages 291–297. IEEE,
2015.
90
Bibliography
[39] James D Laird. Self-attribution of emotion: the effects of expressive behavior on the
quality of emotional experience. Journal of personality and social psychology, 29(4):475,
1974.
[40] Peter J Lang, Mark K Greenwald, Margaret M Bradley, and Alfons O Hamm. Looking at
pictures: Affective, facial, visceral, and behavioral reactions. Psychophysiology, 30:261–
261, 1993.
[41] J LeDoux. Emotion circuits in the brain. 2003.
[42] Hosub Lee, Young Sang Choi, Sunjae Lee, and IP Park. Towards unobtrusive emo-
tion recognition for affective social communication. In Consumer Communications and
Networking Conference (CCNC), 2012 IEEE, pages 260–264. IEEE, 2012.
[43] Po-Ming Lee, Wei-Hsuan Tsui, and Tzu-Chien Hsiao. The influence of emotion on
keyboard typing: an experimental study using visual stimuli. Biomedical engineering
online, 13(1):81, 2014.
[44] Robert LiKamWa, Yunxin Liu, Nicholas D Lane, and Lin Zhong. Can your smartphone
infer your mood. In PhoneSense workshop, pages 1–5, 2011.
[45] Robert LiKamWa, Yunxin Liu, Nicholas D Lane, and Lin Zhong. Moodscope: Building
a mood sensor from smartphone usage patterns. In Proceeding of the 11th annual inter-
national conference on Mobile systems, applications, and services, pages 389–402. ACM,
2013.
[46] Hugo Lovheim. A new three-dimensional model for emotions and monoamine neuro-
transmitters. Medical hypotheses, 78(2):341–348, 2012.
[47] Maryanne Martin. On the induction of mood. Clinical Psychology Review, 10(6):669–697,
1990.
[48] Albert Mehrabian. Basic dimensions for a general psychological theory implications for
personality, social, environmental, and developmental studies. 1980.
[49] George Miller and Christiane Fellbaum. Wordnet: An electronic lexical database, 1998.
[50] Fabian Monrose and Aviel Rubin. Authentication via keystroke dynamics. In Proceedings
of the 4th ACM conference on Computer and communications security, pages 48–56.
ACM, 1997.
[51] Fabian Monrose and Aviel D Rubin. Keystroke dynamics as a biometric for authentica-
tion. Future Generation computer systems, 16(4):351–359, 2000.
91
Bibliography
[52] AFM Nazmul Haque Nahin, Jawad Mohammad Alam, Hasan Mahmud, and Kamrul
Hasan. Identifying emotion by keystroke dynamics and text pattern analysis. Behaviour
& Information Technology, 33(9):987–996, 2014.
[53] Rosalind W Picard. Affective Computing. MIT Press, 2000.
[54] Maja Pusara and Carla E Brodley. User re-authentication via mouse movements. In
Proceedings of the 2004 ACM workshop on Visualization and data mining for computer
security, pages 1–8. ACM, 2004.
[55] Ira J Roseman and Craig A Smith. Appraisal theory: Overview, assumptions, varieties,
controversies. 2001.
[56] James A Russell. A circumplex model of affect. Journal of personality and social psy-
chology, 39(6):1161, 1980.
[57] Sergio Salmeron-Majadas, Olga C Santos, and Jesus G Boticario. Exploring indicators
from keyboard and mouse interactions to predict the user affective state. In Educational
Data Mining 2014, 2014.
[58] Pragya Shukla and Rinky Solanki. Web based keystroke dynamics application for identi-
fying emotional state. Internation journal of advanced research in computer science and
communication engineering, 2(11):4489–4493, 2013.
[59] Elizabeth Stapel. Box-and-whisker plots: Interquartile ranges and outliers, 1998.
[60] Carlo Strapparava, Alessandro Valitutti, et al. Wordnet affect: an affective extension of
wordnet. In LREC, volume 4, pages 1083–1086, 2004.
[61] Issa Traore et al. Biometric recognition based on free-text keystroke dynamics. Cyber-
netics, IEEE Transactions on, 44(4):458–472, 2014.
[62] G Tsoulouhas, D Georgiou, and A Karakos. Detection of learner’s affective state based
on mouse movements. J. Comput, 3:9–18, 2011.
[63] Wei-Hsuan Tsui, Poming Lee, and Tzu-Chien Hsiao. The effect of emotion on keystroke:
an experimental study using facial feedback hypothesis. In Engineering in Medicine and
Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE, pages
2870–2873. IEEE, 2013.
[64] Lisa M Vizer, Lina Zhou, and Andrew Sears. Automated stress detection using keystroke
and linguistic features: An exploratory study. International Journal of Human-Computer
Studies, 67(10):870–886, 2009.
92
Bibliography
[65] David Watson, Lee A Clark, and Auke Tellegen. Development and validation of brief
measures of positive and negative affect: the panas scales. Journal of personality and
social psychology, 54(6):1063, 1988.
[66] Rainer Westermann, Kordelia Spies, Gunter Stahl, and Friedrich W Hesse. Relative
effectiveness and validity of mood induction procedures: a meta-analysis. European
Journal of Social Psychology, 26(4):557–580, 1996.
[67] Lotfi A Zadeh. Soft computing and fuzzy logic. IEEE software, 11(6):48, 1994.
[68] Lina Zhou, Judee K Burgoon, Jay F Nunamaker, and Doug Twitchell. Automating
linguistics-based cues for detecting deception in text-based asynchronous computer-
mediated communications. Group decision and negotiation, 13(1):81–106, 2004.
[69] P Zimmermann, Patrick Gomez, Brigitta Danuser, and S Schar. Extending usability:
putting affect into the user-experience. Proceedings of NordiCHI’06, pages 27–32, 2006.
[70] Philippe Zimmermann, Sissel Guttormsen, Brigitta Danuser, and Patrick Gomez. Affec-
tive computing – a rationale for measuring mood with mouse and keyboard. International
journal of occupational safety and ergonomics, 9(4):539–551, 2003.
93
Appendix A
Concept Matrix of Related Work
In this chapter, a concept matrix is presented (Table A.1) containing an overview of all related
literature on emotion recognition in computer systems.
Table A.1: Overview of related literature on emotion recognition in computer systems
Reference Emotion
model
Emotion
elicitation
method
Source of
features
Data la-
beling
method
Method Results
[69] PA emo-
tion model
video clips keystrokes,
mouse
movements
question-
naire (self-
assessment
manikin)
statistical
analysis
conclusion:
possible
discrim-
ination
between
neutral
category
and four
others
94
Appendix A. Concept Matrix of Related Work
[64] physical
stress,
cognitive
stress
tasks in-
ducing
physical
and cogni-
tive stress
keystrokes,
language
parameters
question-
naire (11-
point Lik-
ert scale)
decision
trees,
SVM, k -
NN, Ad-
aBoost,
ANN
75% ac-
curacy
for cogni-
tive stress
(k -NN),
62.5% for
physical
stress (Ad-
aBoost,
SVM,
ANN);
conclusion:
number of
mistakes
during
typing de-
creases un-
der stress
[23] anger,
boredom,
confidence,
distraction,
excitement,
focused,
frustration,
happiness,
hesitance,
nervous-
ness, over-
whelmed,
relaxation,
sadness,
stress,
tired
none keystrokes question-
naire (5-
point Lik-
ert scale
filled from
time to
time)
C4.5 deci-
sion tree
(binary
classifica-
tion for
each emo-
tion)
accuracy
of 77.4%-
87.8% for
confidence,
hesitance,
nervous-
ness, relax-
ation, sad-
ness and
tiredness
95
Appendix A. Concept Matrix of Related Work
[13] confusion,
delight,
boredom,
frustration,
neutral
none keystrokes question-
naire (5-
point Lik-
ert scale)
k -NN,
ANN,
Bayesian
network,
naive
Bayes
accuracy
of 82% for
emotional
valence
classfica-
tion; 53%
for specific
emotions
[62] boredom none mouse
move-
ments,
type of
learning
object
asking
whether
they are
bored
C4.5 deci-
sion tree
accuracy of
more than
90%
[44, 45] PA emo-
tion model
none keystrokes,
smart-
phone us-
age data
question-
naire
multi-
linear re-
gression
conclusion:
individ-
ual fea-
ture subset
increases
accuracy
[42] happiness,
surprise,
anger, dis-
gust, sad-
ness, fear,
neutral
none keystrokes,
smart-
phone sen-
sors, dis-
comfort in-
dex, loca-
tion, time,
weather
messages
reporting
emotional
state
Bayesian
network
continu-
ously up-
dated with
new data
67.52% on
average;
best ac-
curacy for
happiness,
surprise
and neu-
tral
[63] happy, un-
happy
facial feed-
back
keystrokes none statistical
analysis
conclusion:
emotional
states can
be derived
from key-
board be-
haviour
96
Appendix A. Concept Matrix of Related Work
[52] joy, fear,
anger,
sadness,
disgust,
shame,
guilt, tired,
neutral
none keystrokes,
language
parameters
question-
naire filled
in from
time to
time
simple
logistics,
SMO, mul-
tilayer per-
ceptron,
random
tree, C4.5,
BF tree
accuracies
of 70%-
88% for
fixed text
and 64%-
82% for
free text
[30] boredom,
frustration
none keystrokes,
mouse
movements
question-
naire
k -NN clas-
sifier for
each emo-
tional state
accuracies
of 83% for
boredom
and 74%
for frustra-
tion
[43] PA emo-
tion model
pictures
from IAPS
keystrokes question-
naire (self-
assessment
manikin)
statistical
analysis
conclusion:
effect of
emotion
is signif-
icant in
keystroke
duration,
latency
and ac-
curacy
rate; size of
emotional
effect is
small com-
pared to
individual
variability
[15, 16, 17] joy, antic-
ipation,
anger, dis-
gust, sad-
ness, sur-
prise, fear,
acceptance
none keystrokes,
mouse,
touch-
screen in-
teractions
ques-
tionnaire
(PANAS)
statisti-
cal analy-
sis, SVM,
ANN
accuracy
increase
of 5% us-
ing fuzzy
model
97
Appendix B
Participant Registration Form
Figure B.1
98
Appendix C
Participant Consent Form
Figure C.1
99
Appendix D
Software Download Page
Figure D.1
100
Appendix E
Model Hyperparameters
E.1 Regression
Table E.1 contains the results for the hyperparameter optimization process for the general
dynamics regression model that combines including weather data, using the joint PAD space
and using feature selection.
E.2 Classification
Table E.2 contains the results for the hyperparameter optimization process for the general
dynamics classification model that combines not including weather data, using separate PAD
dimensions with classes defined by the k-means clusters (with k = 2), no class weight balanc-
ing, subsampling and no feature selection.
101
Appendix E. Model Hyperparameters
Table E.1: Hyperparameter optimization for general dynamics regression model using grid search
Nr of trees Max nr features R2 MAE MSE EVS
100 sqrt 0.178650307 17.6086127 452.4303151 0.178966012
200 sqrt 0.175740189 17.70640952 453.9273299 0.175937216
300 sqrt 0.17683532 17.67471323 453.2118329 0.176943436
400 sqrt 0.177004813 17.69918836 453.1496681 0.177221663
500 sqrt 0.176618405 17.66490053 453.380924 0.17674802
600 sqrt 0.176685988 17.67822328 453.3584712 0.176828395
700 sqrt 0.176587233 17.66616614 453.4823105 0.176753282
800 sqrt 0.177313162 17.65592698 453.1341193 0.177451509
100 all 0.190923224 17.59279471 445.4954829 0.191133033
200 all 0.189904986 17.53400529 445.840431 0.190069898
300 all 0.191422213 17.57060529 445.0814295 0.191581371
400 all 0.188117428 17.60969947 446.7493161 0.18827938
500 all 0.19199562 17.57228254 444.6736571 0.1921756
600 all 0.185808124 17.62428254 448.0776765 0.185988301
700 all 0.184156946 17.62998095 448.8321779 0.18431094
800 all 0.187177941 17.57182011 447.439742 0.187301759
100 log2 0.17977641 17.67698624 451.8136086 0.179974733
200 log2 0.181768339 17.58517037 450.7620542 0.18196621
300 log2 0.174884347 17.67004444 454.578436 0.175064606
400 log2 0.177525459 17.6417418 453.088688 0.177756924
500 log2 0.178866185 17.59496296 452.3918334 0.179027812
600 log2 0.175404284 17.65729101 454.1455744 0.175656324
700 log2 0.172792224 17.66793862 455.624078 0.172939608
800 log2 0.174905998 17.6841619 454.3912707 0.175093419
102
Appendix E. Model Hyperparameters
Table E.2: Hyperparameter optimization for general dynamics classification model using grid search
Nr of trees Max nr samples Precision Recall F1 ROC AUC
100 sqrt 0.71686927 0.707111808 0.711644788 0.713466663
200 sqrt 0.704371482 0.720373617 0.712215979 0.708925363
300 sqrt 0.728418536 0.735605812 0.731968255 0.730637864
400 sqrt 0.70607313 0.724677925 0.715033069 0.711031661
500 sqrt 0.724117413 0.729862325 0.726893421 0.725434758
600 sqrt 0.714933715 0.708710208 0.711795339 0.713038544
700 sqrt 0.719238781 0.732778821 0.725786337 0.723117703
800 sqrt 0.71267875 0.705886159 0.709097055 0.71023027
100 auto 0.717886841 0.722932798 0.720260402 0.719083396
200 auto 0.718318603 0.731641942 0.724828283 0.722042637
300 auto 0.726802884 0.728300622 0.72745952 0.72699334
400 auto 0.718379264 0.728124523 0.723001153 0.720913997
500 auto 0.71146811 0.729440075 0.720263543 0.716470289
600 auto 0.702790135 0.715592413 0.709103894 0.706543384
700 auto 0.708831857 0.719343856 0.713965292 0.711662466
800 auto 0.714707523 0.73272966 0.723398413 0.719528025
100 log2 0.720422463 0.721104756 0.720521645 0.720205053
200 log2 0.721426877 0.724824651 0.722283768 0.722127182
300 log2 0.704636676 0.724712415 0.714433151 0.710142404
400 log2 0.714535575 0.7323551 0.722995706 0.718913545
500 log2 0.735719652 0.74211784 0.738469775 0.737029937
600 log2 0.716503125 0.722218536 0.719149921 0.717729891
700 log2 0.723037174 0.731058632 0.726792436 0.724931794
800 log2 0.720901299 0.744886874 0.732580893 0.727714217
103