Affect-DML: Context-Aware One-Shot Recognition of Human

Affect-DML: Context-Aware One-Shot Recognition of Human Affectusing Deep Metric Learning

Kunyu Peng Alina Roitberg David Schneider

Marios Koulakis Kailun Yang Rainer Stiefelhagen

Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Germany{firstname.lastname}@kit.edu

Abstract— Human affect recognition is a well-establishedresearch area with numerous applications, e.g. in psychologicalcare, but existing methods assume that all emotions-of-interestare given a priori as annotated training examples. However,the rising granularity and refinements of the human emotionalspectrum through novel psychological theories and the in-creased consideration of emotions in context brings considerablepressure to data collection and labeling work. In this paper, weconceptualize one-shot recognition of emotions in context – anew problem aimed at recognizing human affect states in finerparticle level from a single support sample. To address this chal-lenging task, we follow the deep metric learning paradigm andintroduce a multi-modal emotion embedding approach whichminimizes the distance of the same-emotion embeddings byleveraging complementary information of human appearanceand the semantic scene context obtained through a semanticsegmentation network. All streams of our context-aware modelare optimized jointly using weighted triplet loss and weightedcross entropy loss. We conduct thorough experiments on both,categorical and numerical emotion recognition tasks of theEmotic dataset adapted to our one-shot recognition problem,revealing that categorizing human affect from a single exampleis a hard task. Still, all variants of our model clearly out-perform the random baseline, while leveraging the semanticscene context consistently improves the learnt representations,setting state-of-the-art results in one-shot emotion recognition.To foster research of more universal representations of humanaffect states, we will make our benchmark and models publiclyavailable to the community under https://github.com/KPeng9510/Affect-DML.

I. INTRODUCTION

Can one recognize human emotional state given a sin-gle visual example? Human affect, also referred as humanemotion, has always been a great concern of psychologyresearch, as it is directly linked to the mental- and physicalhealth [31]. Assessment of human emotion allows doctors todiagnose conditions such as Parkinson [3], Huntington [15],Depression [1] and Alzheimer [40] early on. Positive emo-tions have capability to make human lives better and lead tobetter health and more productive work [38].

Deep learning has lead to great success in many humanobservation tasks [6], [20], [21], [29], and also strongly influ-enced the field of visual emotion recognition [42], [24], [33],[17], [18]. However, deep Convolutional Neural Networks(CNNs) are known to be data-hungry and fairly limitedin their ability to adapt to new situations. Like in manyother computer vision fields, the vast majority of humanaffect recognition research assumes that a high amount of

Anger

...

Offline

training

Online

training

Classifier

Novel affect categories

Fear

JoyEngagement

Knowledgetransfer

Sensitive

Many training examples A single training example

Fig. 1. An overview of the one-shot recognition of emotions in contexttask. The a priori knowledge acquired from a label-rich dataset of knowncategories is transferred to categorize human affect in finer classes givena single visual example. Such one-shot recognition is a hard but importanttask, as it detaches changes in context-of-interest or novel psychologicalcategorization schemes from costly data collection and labelling.

labelled examples is available during training for all flavoursof emotions we want to recognize [42], [24], [33], [17]. Thisis problematic for two reasons. First, human affect is a com-plex topic and the categorization schemes constantly evolveand become increasingly refined following the progress ofpsychological theories [34], [12]. For example Cowen et al.[7] identified 27 types of human affect, where the emotionscan be generalizations, specializations or combinations ofeach other. Second, the increasing interest in emotions-in-context [17] leads to the set of possible situations beingmuch more diverse and dynamic, eventually changing overtime, as we cannot capture and annotate all possible types ofenvironments influencing the emotion. Bringing the idea ofcategorizing new flavors of emotions without excessive datalabelling to the scope of researchers’ attention is thereforethe main motivation of our work.

In this work, we propose to incorporate generalizationto previously unseen human affects in finer particle levelthrough a single visual example in the evaluation of emotion

https://github.com/KPeng9510/Affect-DML

https://github.com/KPeng9510/Affect-DML

Sem2Vec

...

FCs

FCs

Embedding

TrainAffect embedderTraining set

Support sample

Test sample

Affect embedder KNN

TestNovel categories classification

Triplet Loss

CE Loss

Labels

F

Fig. 2. An overview of the main structure for Affect-DML proposed in this work. The input features are original full image input, human body imagecropped from original input and the semantic segmentation of the full image. Three separate networks are leveraged for feature extraction. In addition tothe full image feature extraction network and body feature extraction network, an MLP based semantic feature extraction network is utilized based on aembedded semantic vector calculated through leveraging semantic segmentation of the given image to provide more semantic cues for the representationextraction. Note that we rename the image feature extraction network in [18] into full image feature extraction network in our work. Triplet-margin lossand cross entropy loss with weights are leveraged for supervision of the representation extraction and K-nearest-neighbour (KNN) approach is utilized toclassify the samples from unseen classes according to the distance with the support set samples in embedding space.

recognition models and introduce the one-shot recognitionof emotions in context task (see problem formulation inFigure 1). Given a label-rich dataset of known categories,we aim to train a model which learns discriminative emotionembeddings that are well-adaptable for classification of novellabel-scarce categories. To meet the challenge of learninggeneralizable affect embeddings, we follow the deep metriclearning [22] paradigm, augmenting the existing emotionclassification models [18] trained on Emotic dataset [17] withadditional triplet loss [4], which minimizes the distance ofthe same-emotion embeddings. Furthermore, we introduce anew architecture for one-shot emotion classification, whichleverages the semantic scene context obtained from a seman-tic segmentation framework. We believe that our Sem2Veccontext embeddings carry highly discriminative informationspecifically for emotions in context, as it allows to effectivelyinterpret the surroundings in the case of scarce training data,which is empirically validated through our experiments.

Contributions and Summary. Given the dynamic natureof environments influencing human affect and the contin-uous refinement of the emotion taxonomies through newfindings of psychological research, we argue that reducingthe burden of data collection for incoming categories of finerhuman-affect categories is an important and under-researchedproblem. This paper makes a step towards generalizableemotion recognition models which are able to adapt to novelcategories with a single visual example and has two majorcontributions. (1) First, we formalize the under-researchedtask of one-shot recognition of emotions in context (seeFigure 1), extending the conventional setup [17] with novellabel-scarce types of affect present at test time. To thisintent, we augment the original Emotic dataset [17] withsingle-example emotions present during evaluation, defin-ing the benchmark protocols for one-shot categorization

of human affect as both, categorical- and numerical (i.e.emotion assessment on a continuous scale) problems. (2)To achieve more generalizable representations of humanaffect, we follow the deep metric learning paradigm andintroduce an approach for context-aware one-shot humanaffect recognition. We build on the off-the-shelf approach forconventional emotion classification [18], augmenting it witha semantic scene context branch, which obtains semantic em-beddings (Sem2Vec) of the surroundings through a semanticsegmentation framework [45] and is optimized jointly withthe core network using triplet loss and cross entropy loss.Thereby, Sem2Vec mitigates the problem of data scarcitythrough discriminative environment representation achievedthrough knowledge transfer from a segmentation model,outperforming the native affect classification architecture by> 8% in our one-shot context. To encourage future researchof more universal emotion representations and release thepressure brought by extensive labelling of new examples,we will make our benchmark available to the public.

II. RELATED WORK

A. Human affect recognition

Most of the classical affect recognition methods focus onfacial expressions [5], [11], [39] and sometimes leverageadditional cues, such as body pose [25], [26], [35] or eyegaze [49]. The recently emerged Emotic dataset [17] enabledlarge-scale studies of recognizing emotions in context, con-sidering diverse surrounding scenes influencing human affectand resulting in multiple neural architectures developed forthis task [2], [23], [30], [18]. Mittal et al. [23] leveraged threeinterpretations, i.e., multi-modal, attention-driven and depth-based context for dynamic emotion recognition. Ruan etal. [30] designed an architecture that makes use of bothglobal image context and local details of the target personfor multi-label emotion recognition. Antoniadis et al. [2]

proposed to exploit emotional dependencies by using a graphconvolutional network. Differently from these works, wepropose a semantic-aware one-shot human affect recognitionmethod which handles the challenges of novel classes emerg-ing in unconstrained conditions, while leveraging contextencoding based on semantic segmentation maps.

B. One-shot recognition

One-shot learning has been a hot research topic over thelast years. Existing methods mainly pursue the paradigmsincluding data augmentation, transfer learning, deep embed-ding learning, and meta learning. Data augmentation basedmethods [10] enrich the training data to produce more diverseexamples or hallucinate examples of novel classes. Transferlearning based methods [19], [48] aim to reuse knowledgein learned tasks. Deep embedding learning methods [16],[44] are built to design low-dimension embedding spacesto yield more discriminative feature representations. Metalearning models [32], [48] attempt to harvest knowledgeamong multiple tasks. There are additional few-shot learningmethods [13], [22], [28], [36] for various tasks like face-and action recognition, driver observation, and semanticsegmentation. The work most similar to ours is presumablythe one of Cruz et al. [8] (2014) who proposed a Haar-like-features-based framework for one-shot classification offacial expressions. However, the setting of [8] is very re-strictive compared to modern datasets [17] and puts theenvironment aside while focusing on frontal facial imagesonly. Furthermore, Wang et al. [46] addressed on few-shotsentiment analysis on social media, while Zhan et al. [50]considered zero shot emotion categorization, which, however,is very different from the one-shot task as it is centeredaround finding good textual embeddings for the unknownemotions, while one-shot learning assumes a single visualexample present at test-time. In this work, we conceptualize anovel task of one-shot emotion recognition in context, whichhas been largely overlooked in previous research.

III. ONE-SHOT RECOGNITION OF HUMAN AFFECT

A. Problem Formulation & One-Shot Emotic Benchmark

We tackle the problem of one-shot recognition of emotionsin context from image data, which aims to assign the correcthuman affect class given a single support set sample forthe unseen categories. First, we provide a formal definitionof the problem, with an overview illustrated in Figure 1.Conceptually, one-shot human affect recognition aims totransfer a priori knowledge acquired from a label-rich datasetof known categories to categorize human affect in finer-particle level given a single support set sample.

Formally, let Cnovel denote the unseen classes and Cbasesindicate the seen classes. One-shot recognition is the task ofclassifying examples in a set of unseen classes Cnovel givenone single sample per unseen class in the support set, anda large amount of training data for the seen classes Cbase.Formally, we start with a training set D = {(dn,yn)}N

n=1,yn ∈Cbase. Then a support set R = {xi}|Cnovel |

i=1 , where Cbase ∩Cnovel = /0, and a query set Q = {qn}M

n=1 are supplied and the

TABLE IHUMAN AFFECT CATEGORICAL CLASS MAPPING FOR ONE-SHOT HUMAN

AFFECT RECOGNITION.

Mode One-shot classes Emotic [17] classesCAT-6:6 split CAT-6:4 split

Train

Angry

DisapprovalAversionAnnoyanceAnger

DisapprovalAversionAnnoyanceAnger

Sadness

SufferingSadnessFatiguePain

SufferingSadnessFatiguePainSensitiveEmbarrassment

Fear FearDisquitement

FearDisquitement

LovePeaceAffectionSympathy

PeaceAffectionSympathy

Joy

HappinessPleasureExcitementAnticipation

HappinessPleasureExcitementAnticipation

SurprisingSurprisingConfuse/DoubtConfidence

SurprisingConfuse/DoubtConfidence

Test

Disconnection Disconnection DisconnectionEngagement Engagement Engagement

Sensitive Sensitive ✗Embarrassment Embarrassment ✗

Esteem Esteem EsteemYearning Yearning Yearning

goal is to assign a class y ∈Cnovel to each qn ∈ Q. Note thatthe query set is also the testing set.

Categorical benchmark. In order to provide a benchmarkfor one-shot recognition of emotions in context, we refor-mulate the 26 categories covered by the well-establishedEmotic dataset [17] developed for conventional human affectclassification to our one-shot recognition task. To achievethis, we split the dataset category-wise in the seen classesCbases and unseen classes Cnovel , following Table I. As oursupport set R we select the first sample of the correspondingunseen category generated through random order, whilethe remaining categories constitute the query set Q alsonoted as test set. Due to the trend of making human affecttaxonomies more and more fine-grained, our seen classesCbases comprise coarser types of human affects, which areanger, sadness, joy, love, fear and surprise constructed bythe emotional categories provided by Emotic dataset [17].The fine-to-coarse cluster relationship of these human affectcategories follows the human affect tree structure proposedby Shaver et al. [37]. The test set Cnovel therefore representsmore fine-grained human affects. To honor the one-shotlearning premise, we reassure disjoint training and testingcategories, i.e. Cbase ∩Cnovel = /0 also taking into accountgeneralizations to make sure there is no overlapping. Weconsider two kinds of train-test split for the categorical one-shot human affect recognition task, denoted as CAT-6:4 andCAT-6:6, with the detailed overview provided in Table I.

Reformulated numerical benchmark. We further refor-

mulate numerical human affect regression task into one-shot recognition task to further verify the efficiency of ourapproach on the unnamed human affect categories definedby different level of the three main components, which arevalence, arousal and dominance in Emotic dataset [17] basedon the work from Posner et al. [27], considering the discretelevel annotation from 1 to 10 as in [17] as category labelsfor each component, with the splits denoted as LEV-7:3 andLEV-6:4 for different seen-unseen proportions.

Dealing with multi-label annotations. Since Emotic [17] isa multi-label dataset, we leverage a multiple-to-single labelmapping approach to realize unique label encoding for eachsample, demonstrated in detail in following. Assuming Nclasses in total for both training and testing, let e denote a N-dimensional zero vector. If emotional labels for one sampleare contained by list W , then the desired label y for thissample can be encoded as

e[Label2index(li)] + = 1, f or li in W, (1)

y = argmax(e), (2)

where Label2index maps a label from Emotic dataset [17]into one-shot encoded label according to Table I consideringthe column index for categorical human affects, and for nu-merical human affect one-shot recognition, [1,3,5,6,8,9,10]is selected as training set for LEV-7:3 train-test split and[1,3,5,6,8,9] is selected as training set for LEV-6:4 train-testsplit while the rest classes are set as testing set for each. TheLabel2index is identity mapping for numerical human affectone-shot recognition benchmark. The final label for eachsample is generated through an Argmax operation accordingto the encoded vector e.

B. Deep Metric Learning Architecture

To address one-shot recognition of emotions-in-contexttask, we follow the deep metric learning paradigm [22](main workflow illustrated in Figure 2). Our model takesan image I as input together with the location of theperson-of-interest (as in [17]) which we use to obtainthe body crop image Ibody. This information is passedthrough the networks fbody, fimg, fsem to produce the featuresfbody(Ibody), fimg(Iimg), fsem(Isem) capturing target person, fullimage and its semantic context respectively. Those featuresare fused and embedded through network femb. The networkis optimized to learn highly seperable human affect represen-tations with deep metric learning (DML). During inference,the finer unseen classes are classified based on the samplein support set closest to them in the embedding space. Wenow describe key components of our architecture in detail.

C. Feature encoders

All three feature encoders fbody, fimg, fsem are off-the-shelfnetworks with the exception of fsem where we enhance anexisting architecture to build a compact segmentation featurevector. ResNet [14] classification model is leveraged to formfbody and fimg individually while discarding softmax layer,

and separately the crop of person Ibody and full image Iimgare utilized as input for two individual feature extractionbranches. Finally, the final semantic embedded feature can bedemonstrated as fsem = Mδ ◦Sem2Vec, where Sem2Vec con-sists of an off-the-shelf segmentation network together with alayer providing the set of predicted semantic classes in a one-hot-encoded form and Mδ denotes fully-connected layers. Toelaborate on Sem2Vec, assume that Sem is the segmentationnetwork and Sem(I) denotes the semantic segmentation resultof image I. If Set ◦ Sem(I) is the set of all distinct classesappearing on at least one pixel, and X = Sem2Vec(I), then

X [i] ={

1, if i ∈ Set ◦Sem(I)0, otherwise (3)

The length of X is fixed as the number ofpre-defined total classes of semantic segmentationnetwork. The features of the three branches areembedded to the vector femb(Iimg, Ibody, Isem) =Gθ ◦ F( fbody(Ibody), fimg(Iimg), fsem(Isem)), where F isconcatenation operation and Gθ denotes a fully-connectedlayer, mapping the concatenated feature into desired-dimensional representation.

D. Deep Metric learning

Our training procedure builds on the deep metric learningpipeline of [22]. On the sampling phase, a batch of samplesB= {(xn,yn)}Nb

n=1 is picked and embedded via femb to Bemb =

{(zn,yn)}Nbn=1. The batch size is denoted as Nb. Right after, the

Multi-Similarity Miner [47] is used to select the most infor-mative positive and negative pairs of embeddings. Namely,if {(zai ,zpi)}

Npi=1, {(zai ,zni)}

Nni=1 are all possible positive and

negative pairs of Bemb and S is cosine similarity, where Npdenotes negative sample number and Np denotes positivesample number. Then the filter processing is defined as

S(zai ,zpi)> maxyk ̸=yai

S(zai ,zak)+ ε (4)

S(zai ,zni)< minyk=yai

S(zai ,zak)− ε (5)

Thus we keep only the positive pairs containing relativelysmaller (relative to ε) similarity than the most similar neg-ative pair of the anchor and the negative pairs containing aslightly higher similarity than the least similar positive pairof the anchor.

The positive and negative pairs sharing the same anchorare combined to create triplets {(zai ,zpi ,zni)}

Nti=1 which are

used in computing the triplet loss with margin λ , with Ntdenotes triplet number:

Ltriplet =Nt

∑i=1

[∥zai − zni∥−∥∥zai − zpi

∥∥+λ ]+ (6)

Aside from being used to compute Ltriplet , the embeddingsBemb are passed through a fully-connected layer and generatea classification loss Lclass (cross entropy). The total loss isthen a weighted sum of the aforementioned two losses:

Lall = αLtriplet +βLclass (7)

Label: Disconnection Label: Engagement[a]: Sensitive

[b]: Engagement

Label: Sensitive[a]: Sensitive

[b]: Engagement

Label: Sensitive

[a]: Embarrassment [b]: Esteem

Label: Embarrassment[a]: Embarrassment [b]: Embarrassment

Label: Embarrassment Label: Esteem Label: Yearning[a]: Yearning

[b]: Embarrassment

[a]: Esteem [b]: Engagement

Label: Yearning[a]: Disconnection

[b]: Sensitive

Label: Disconnection[a]: Embarrassment [b]: Disconnection

[a]: Engagement [b]: Disconnection

[a]: Disconnection [b]: Engagement

[a]: Engagement [b]: Sensitive

Label: Engagement[a]: Disconnection [b]: Disconnection

Label: Disconnection Label: Embarrassment Label: Esteem[a]: Yearning [b]: Esteem

[a]: Embarrassment [b]: Engagement

Label: Engagement[a]: Sensitive

[b]: Engagement

Fig. 3. A comparison of the one-shot categorical human affect recognition between our best model of Affect-DML indicated by [a] and Kosti et al. [18]extended with deep metric learning (IB-DML†) indicated by [b] on the test set. False recognition is marked as red color. The train-test split for these twomodels is CAT-6:6. The blue bounding box indicates the annotated person for categorical human affect recognition according to Emotic dataset [17].

E. Inference

Deep metric learning pushes the embedder to learn class-divisible representations in the embedding space. This makesit possible to reduce the one-shot recognition problem to anearest neighbour search in the embedding space, aiming atreducing the distance inside class and increase the distancebetween different classes.

Let R = {xi}|C|i=1, where C = Cnovel is the set of novel

classes, be the support set of the one-shot recognition andx = (Ibody, Iimage, Isem) is a single example to classify. Alsoassume Remb = {zi}|C|

i=1 and z are the embedded vectors ofthose examples via femb. We make the assumption that foreach class i, the probability of a sample belonging to thatclass is given by an isotropical distribution centered aroundzi. We also assume that the density of the distribution isstrictly decreasing on the distance from zi, thus the densityis defined as p(z|yi) = f (∥z− zi∥). A concrete candidate forf could be for example the density of a multidimensionalisotropical normal distribution centered around zi. We furtherassume that the support classes are independent from eachother and they occur with equal frequencies. Then by Bayes

rule we have:

P(yi|z) =p(z|yi)P(yi)

∑j∈C

p(z|y j)P(y j)(8)

=p(z|yi)

1|C|

∑j∈C

p(z|y j)1|C|

= γ f (∥z− zi∥), (9)

where γ is constant for different classes and a fixed value ofz. But then, the predicted class of x reduces to

ypred = argmaxyi∈C

f (∥z− zi∥), (10)

which is equivalent to selecting the nearest neighbour ofz in the embedding space. Note that though the aboveassumptions are quite restricting, they are in some sense thebest one can hope for given the very limited informationone-shot learning provides us with, illustrating the ability ofdeep metric learning to create a good embedding space forboth seen and unseen classes which decides how well ourproposed approach can separate the unseen classes.

Segmentation-driven context encoding. In order to extractsemantic related attributes-based features from full image,

TABLE IIRESULTS FOR ONE-SHOT CATEGORICAL RECOGNITION.

Method Setting AccuracyImage Body Semantic CAT-6:6 CAT-6:4

Baseline methodsRandom ✗ ✗ ✗ 16.78 16.68Kosti et al. [18]† ✓ ✓ ✗ 29.98 30.22

Image/Body-based architectures [18] trained with DMLI-DML ✓ ✗ ✗ 17.84 27.64B-DML ✗ ✓ ✗ 15.41 27.66IB-DML ✓ ✓ ✗ 21.02 28.19IB-DML † ✓ ✓ ✗ 24.20 29.46

Semantic segmentation-enhanced architecturesSem-I-DML ✓ ✗ ✓ 32.16 33.48Sem-IB-DML ✓ ✓ ✓ 33.76 36.26† Body feature extraction network is pretrained on Places-365 [51] and fullimage feature extraction network is pretrained on ImageNet [9] to reproducethe setting of Kosti et al. [18] (all other models except random baseline areboth pretrained using ImageNet [9] for body feature extraction network andfull image feature extraction network.)

Sem2Vec approach is proposed in our work to leverageunderlying semantic cues to enhance the recognition ofhuman affect. First, as aforementioned semantic segmenta-tion image is generated through well-trained HRNet [45]model based on ADE20K [52] dataset with Nsem semanticsegmentation classes given the full image as input (we use150 segmentation categories in our experiments). After thegeneration of binary encoded vector as aforementioned todescribe which kind of object occurred in the full imageand provide semantic cues for human affect recognitionillustrated as X = Sem2Vec(x), the final representation fsemharvesting semantic cues is extracted based on the vectorrepresentation of the semantic information:

fsem = Mδ (Sem2Vec(x)), (11)

where Mδ denotes fully-connected layers with ReLU andnormalization.

IV. EXPERIMENTS

We conduct extensive experiments on the One-Shot-Emotic benchmark. First, we describe the implementationdetails of our proposed approach in Section IV-A. Then,we present our quantitative results on both categorical andnumerical one-shot human affect recognition in Section IV-Band Section IV-C accordingly. Finally, Section IV-D conductsan ablation of the model architecture choices and providesqualitative examples of the predictions.

A. Implementation Details

We use ResNet [14] pre-trained on ImageNet [9] as ourbody pose encoding network fbody and image representationencoding network fimg. Tables II and III provide resultsusing the ResNet18 version of the backbone, while Table IVcompares ResNet18 and ResNet50 to analyze the impactof the model capacity. The input preprocessing for bodypose encoding network covers cropping the person boundingboxes through annotations provided by [18] and rescaling thecropped image to 256× 256. For our emotion recognitionarchitecture, we remove the last softmax layer and append

a single fully-connected layer to both human body and fullimage networks in order to project the output logits into afixed 512-dimensional feature space. The aforementioned Mδ

with channel setting [256,512], leveraging a Rectified LinearUnit (ReLU) as activation function followed by a batch nor-malization layer between fully-connected layers with batchsize 32. RMSprop [41] is used as optimizer with learning rate3.5e−4 for categorical human affect experiments describedin Section IV-B. For the numerical human affect recognitionexperiments described in Section IV-C, the learning rate isalso set to 3.5e−4 for the LEV-7:3 train-test split and to3.5e−6 for the LEV-6:4 (the learning rate was chosen usinggrid search to optimize the validation set performance). Theweight decay factor γ is universally set to 0.1 with a stepsize of 4 epochs. Our approach is trained directly with thetriplet loss and cross entropy loss with weights setting 0.5for α and 0.5 for β to balance representation learning andclassification. Further details are listed in the supplementary.

B. One-shot Categorical Recognition Results

Table III compares the original approach of Kosti etal. [18] developed for standard emotion recognition withdifferent variants of our enhancement with DML trainingand our proposed segmentation-aware model. Different splittypes are marked as CAT-6:6 and CAT-6:4 (see Section III-A for details). We also consider the impact of the individualbranches, i.e., the image- (I-DML) and the body branch (B-DML) as well as their combination with each other (IB-DML)and with the semantic segmentation branch (Sem-B-DML andSem-IB-DML). In the following, we also refer to our fullmodel Sem-IB-DML as Affect-DML.

Our quantitative evaluation highlights the advantage ofleveraging the semantic segmentation-driven embedding fsemboth with and without the body-focused network. Exchang-ing the body network fbody from IB-DML with our semanticnetwork Sem-I-DML provides a remarkable performance gainof 11.14% on CAT-6:6 and about 5.29% points on CAT-6:4.Combining the representations of all three networks (Sem-IB-DML) further increases the performance by 1.60% on CAT-6:6 and 2.78% on CAT-6:4. Our method also outperforms theoriginal method of Kosti et al. [18] by a significant margin.

C. One-shot Numerical Recognition Results

Table III lists the one-shot evaluation results of our numer-ical human affect recognition benchmark on the two train-test splits indicated by LEV-7:3 and LEV-6:4 (see SectionIII-A). Since leveraging the body representation branch andthe image feature extraction network outperforms Kosti etal. [18] extended with DML (IB-DML†) in most metrics.Once again, combining the full image representation branchwith our semantic embedding branch fsem from SEM-I-DMLinstead of a body feature extraction network network fromIB-DML improved the performance by 2.16% for averageaccuracy on the three numerical human affect dimensionsvalence, arousal and dominance (split LEV-7:3). Note, thatwe still evaluate the numerical emotion with categoricalaccuracy (assigning the predictions to ten partitions of the

TABLE IIIEXPERIMENTAL RESULTS ON ONE-SHOT EMOTIC DATASET FOR NUMERICAL HUMAN AFFECT.

Name Setting LEV-7:3 split Acc LEV-6:4 split AccImage Body Semantic Valence Arousal Dominance Avg Acc Valence Arousal Dominance Avg Acc

Baseline methodsRandom ✗ ✗ ✗ 12.23 6.86 13.26 10.79 7.62 11.73 13.28 10.88Kosti et al. [18]† ✓ ✓ ✗ 36.03 45.01 45.02 42.02 28.68 33.65 37.24 34.78

Image/Body-based architectures [18] trained with DMLI-DML ✓ ✗ ✗ 46.72 46.89 55.63 49.75 30.63 38.31 63.36 44.10B-DML ✗ ✓ ✗ 35.89 41.95 54.99 44.28 37.05 36.91 53.62 42.53IB-DML ✓ ✓ ✗ 50.75 46.03 59.16 51.98 48.96 38.20 58.69 48.62IB-DML ( [18] with DML) † ✓ ✓ ✗ 49.00 43.53 56.23 49.59 47.33 35.13 55.99 46.15

Semantic segmentation-enhanced architectures (ours)Sem-I-DML (ours) ✓ ✗ ✓ 56.79 46.84 58.78 54.14 51.76 38.72 65.29 51.92Sem-IB-DML (ours) ✓ ✓ ✓ 52.36 47.19 64.52 54.89 58.92 41.37 62.56 54.28

† Body feature extraction network is pretrained on Places-365 [51] and full image feature extraction network is pretrained on ImageNet [9] to reproducethe setting of Kosti et al. [18] (all other models except random baseline are both pretrained using ImageNet [9] for body feature extraction network andfull image feature extraction network.)

TABLE IVEXPERIMENTS FOR DIFFERENT FULL IMAGE AND BODY FEATURE EXTRACTION NETWORK STRUCTURE.

Image/bodymodel ResNet18 ResNet50

Split CAT-6:6 CAT-6:4 LEV-7:3 LEV-6:4 CAT-6:6 CAT-6:4 LEV-7:3 LEV-6:4Affect mode Cat. Cat. Val. Aro. Dom. Val. Aro. Dom. Cat. Cat. Val. Aro. Dom. Val. Aro. Dom.Accuracy 33.76 36.26 52.34 47.19 64.52 58.92 41.37 62.56 35.80 36.24 56.27 48.51 64.51 57.69 41.59 62.22

continuous space as the annotation in [17]). The gain inaccuracy is consistent and also holds on the LEV-6:4 train-test split where SEM-I-DML improves the results by 3.30%points over IB-DML. In contrast to the experiments on LEV-7:3, combining all three networks yields a further improve-ment of about 2.36% compared with Sem-I-DML, illustratingthat the modalities mostly complement each other and theirfusion is beneficial to our task.

D. Architecture Analysis and Qualitative Results

In Table IV, we analyze the influence of the backbonearchitecture size on the recognition results (ResNet18 vs.ResNet50 [14]). The ResNet50 architectures indicates betterresults on some tasks like CAT-6:6, the recognition ofvalence and arousal on LEV-7:3 and the recognition ofarousal on LEV-6:4 while the ResNet18 architecture per-forms better on the other tasks. When taking a closer look,the performance gains of ResNet50 are significant, while incases where the smaller architecture shows advantages, theResNet18 architecture only has slight performance gain of0.4% on average over where it outperforms ResNet50.

In Figure 4 we provide a 2-dimensional t-SNE [43] anal-ysis of the latent space embedding learnt by our best modelAffect-DML. Each embedding has been colored according toits categories in testing set. There is a clear separation ofthe individual clusters (the color represents the ground truthemotion category). Note, that the visualization only coverscategories not present during the classifier training. Thisindicates, that representation learned through combination oftwo loss functions provides representations which are highlydiscriminative and generalize well to novel classes, as thevisualized states have not been seen during training.

Finally, we provide multiple qualitative prediction resultsin Figure 3. Our Affect-DML model is referenced by [a] andcompared with IB-DML† (Kosti et al. [18] extended with

X DisconnectionX EngagementX Sensitive

EmbarrassmentX EsteemX Yearning

Fig. 4. A visualization of the intermediate embeddings for the unseenemotion classes as a 2-dimensional t-SNE representation. There is a clearseparation of the individual clusters even though the categories were notpresent during training. Best viewed in color.

DML), referenced by [b]. In most examples, Affect-DMLprovides better predictions, a result which is consistent withthe quantitative evaluation in Table II where Sem-IB-DML(Affect-DML) outperforms IB-DML by 8.07% for CAT-6:4.

V. CONCLUSION

In this paper, we proposed Affect-DML – a frameworkfor recognizing human emotions in context given a singlevisual example. Such one-shot recognition of human affect isa challenging but important task, because although the rangeof facial expressions remains the same, psychological cate-gorization schemes and contexts-of-interest which influencethe emotions continuously change. We formalize the one-shotcategorization problem by augmenting the Emotic datasetand allowing only a set of coarse emotions be present duringtraining, while during inference the model recognizes morefine-grained emotions from a single image example. Ourframework differs from the existing emotion categorization

methods in two ways. First, we optimize our model usingdeep metric learning, leveraging a combination of crossentropy- and triplet loss. Second, we propose to leverage anencoding of the maps produced by a semantic segmentationnetwork, which we refer to as Sem2Vec. We believe, thatsemantic segmentation information is vital for generalizablemodels of emotions in context, which is consistently con-firmed in our extensive experiments.

Acknowledgments. The research leading to these results wassupported by the SmartAge project sponsored by the CarlZeiss Stiftung (P2019-01-003; 2021-2026).

REFERENCES

[1] I. M. Anderson et al. State-dependent alteration in face emotionrecognition in depression. The British Journal of Psychiatry, 2011.

[2] P. Antoniadis, P. P. Filntisis, and P. Maragos. Exploiting emotionaldependencies with graph convolutional networks for facial expressionrecognition. arXiv, 2021.

[3] S. Argaud, M. Vérin, P. Sauleau, and D. Grandjean. Facial emotionrecognition in Parkinson’s disease: A review and new hypotheses.Movement Disorders, 2018.

[4] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk. Learning localfeature descriptors with triplets and shallow convolutional neuralnetworks. In BMVC, 2016.

[5] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martínez. EmotioNet:An accurate, real-time algorithm for the automatic annotation of amillion facial expressions in the wild. In CVPR, 2016.

[6] J. Carreira and A. Zisserman. Quo vadis, action recognition? A newmodel and the kinetics dataset. In CVPR, 2017.

[7] A. S. Cowen and D. Keltner. Self-report captures 27 distinct categoriesof emotion bridged by continuous gradients. PNAS, 2017.

[8] A. C. Cruz, B. Bhanu, and N. S. Thakoor. One shot emotion scoresfor facial emotion recognition. In ICIP, 2014.

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:A large-scale hierarchical image database. In CVPR, 2009.

[10] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox.Discriminative unsupervised feature learning with convolutional neuralnetworks. In NeurIPS, 2014.

[11] S. Du, Y. Tao, and A. M. Martinez. Compound facial expressions ofemotion. PNAS, 2014.

[12] P. Ekman. Basic emotions. Handbook of Cognition and Emotion,1999.

[13] Y. Guo and L. Zhang. One-shot face recognition by promotingunderrepresented classes. arXiv, 2017.

[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In CVPR, 2016.

[15] S. M. Henley, M. J. Novak, C. Frost, J. King, S. J. Tabrizi, and J. D.Warren. Emotion recognition in Huntington’s disease: A systematicreview. Neuroscience & Biobehavioral Reviews, 2012.

[16] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networksfor one-shot image recognition. In ICMLW, 2015.

[17] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza. EMOTIC:Emotions in context dataset. In CVPRW, 2017.

[18] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza. Context basedemotion recognition using EMOTIC dataset. TPAMI, 2020.

[19] F. Li, R. Fergus, and P. Perona. A Bayesian approach to unsupervisedone-shot learning of object categories. In ICCV, 2003.

[20] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang. Skeleton-based action recognition using spatio-temporal LSTM network withtrust gates. TPAMI, 2018.

[21] I. Masi, Y. Wu, T. Hassner, and P. Natarajan. Deep face recognition:A survey. In SIBGRAPI, 2018.

[22] R. Memmesheimer, N. Theisen, and D. Paulus. SL-DML: Signal leveldeep metric learning for multimodal one-shot action recognition. InICPR, 2021.

[23] T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, andD. Manocha. EmotiCon: Context-aware multimodal emotion recogni-tion using frege’s principle. In CVPR, 2020.

[24] A. Mollahosseini, B. Hasani, and M. H. Mahoor. AffectNet: Adatabase for facial expression, valence, and arousal computing in thewild. TAC, 2019.

[25] W. Mou, O. Celiktutan, and H. Gunes. Group-level arousal and valencerecognition in static images: Face, body and context. In FG, 2015.

[26] M. A. Nicolaou, H. Gunes, and M. Pantic. Continuous predictionof spontaneous affect from multiple cues and modalities in valence-arousal space. TAC, 2011.

[27] J. Posner, J. A. Russell, and B. S. Peterson. The circumplex model ofaffect: An integrative approach to affective neuroscience, cognitive de-velopment, and psychopathology. Development and Psychopathology,2005.

[28] S. Reiß, A. Roitberg, M. Haurilet, and R. Stiefelhagen. Activity-awareattributes for zero-shot driver behavior recognition. In CVPRW, 2020.

[29] A. Roitberg, M. Haurilet, S. Reiß, and R. Stiefelhagen. CNN-baseddriver activity understanding: Shedding light on deep spatiotemporalrepresentations. In ITSC, 2020.

[30] S. Ruan et al. Context-aware generation-based net for multi-labelvisual emotion recognition. In ICME, 2020.

[31] P. Salovey, A. J. Rothman, J. B. Detweiler, and W. T. Steward.Emotional states and physical health. American Psychologist, 2000.

[32] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap.Meta-learning with memory-augmented neural networks. In ICML,2016.

[33] A. V. Savchenko. Facial expression and attributes recognition basedon multi-task learning of lightweight neural networks. arXiv, 2021.

[34] K. R. Scherer, V. Shuman, J. Fontaine, and C. Soriano Salinas. Thegrid meets the wheel: Assessing emotional feeling via self-report.Components of Emotional Meaning: A Sourcebook, 2013.

[35] K. Schindler, L. Van Gool, and B. De Gelder. Recognizing emotionsexpressed by body pose: A biologically inspired neural model. NeuralNetworks, 2008.

[36] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots. One-shot learningfor semantic segmentation. In BMVC, 2017.

[37] P. Shaver, J. Schwartz, D. Kirson, and C. O’connor. Emotionknowledge: further exploration of a prototype approach. Journal ofPersonality and Social Psychology, 1987.

[38] L. Shu et al. A review of emotion recognition using physiologicalsignals. Sensors, 2018.

[39] M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic. Analysis ofEEG signals and facial expressions for continuous emotion detection.TAC, 2016.

[40] I. Spoletini et al. Facial emotion recognition deficit in amnestic mildcognitive impairment and alzheimer disease. The American Journalof Geriatric Psychiatry, 2008.

[41] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide thegradient by a running average of its recent magnitude. COURSERA:Neural Networks for Machine Learning, 2012.

[42] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, andS. Zafeiriou. End-to-end multimodal emotion recognition usingdeep neural networks. IEEE Journal of Selected Topics in SignalProcessing, 2017.

[43] L. Van der Maaten and G. Hinton. Visualizing data using t-SNE.JMLR, 2008.

[44] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra.Matching networks for one shot learning. In NeurIPS, 2016.

[45] J. Wang et al. Deep high-resolution representation learning for visualrecognition. TPAMI, 2020.

[46] L. Wang, X. Xu, F. Liu, X. Xing, B. Cai, and W. Lu. Robust emotionnavigation: Few-shot visual sentiment analysis by auxiliary noisy data.In ACIIW, 2019.

[47] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott. Multi-similarity loss with general pair weighting for deep metric learning.In CVPR, 2019.

[48] Y.-X. Wang and M. Hebert. Learning to learn: Model regressionnetworks for easy small sample learning. In ECCV, 2016.

[49] E. Wolf, M. Martinez, A. Roitberg, R. Stiefelhagen, and B. Deml.Estimating mental load in passive and active tasks from pupil andgaze changes using bayesian surprise. In ICMIW, 2018.

[50] C. Zhan, D. She, S. Zhao, M.-M. Cheng, and J. Yang. Zero-shotemotion recognition via affective structural embedding. In ICCV,2019.

[51] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places:A 10 million image database for scene recognition. TPAMI, 2018.

[52] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba.Scene parsing through ADE20K dataset. In CVPR, 2017.

Documents

Affect-DML: Context-Aware One-Shot Recognition of Human