A Survey and Critique of Facial Expression Synthesis in ... · 10/11/2014 · A Survey and Critique of Facial Expression Synthesis in Sign Language Animation Hernisa Kacorri SECOND

A Survey and Critique of Facial Expression Synthesis in Sign

Language Animation Hernisa Kacorri

SECOND EXAM Doctoral Program in Computer Science

City University of New York – The Graduate Center

COMMITTEE Matt Huenerfauth (Mentor)

Andrew Rosenberg and Raquel Benbunan-Fich

November 10, 2014

Hernisa Kacorri Graduate Center CUNY 1/35

Overview •  Introduction – Linguistics of Sign Language – Facial Expressions in Sign Language – Animation Technologies in Sign Language – Synthesis of Facial Expression Animations

•  Critique and Comparison of Five Selected Papers – Paper Selection Criteria – Paper Critiques for Each Project – Overall Comparison

•  Conclusions and Future Prospects


Linguistics of Sign Languages

•  Sign language has a distinct word-order, syntax, and lexicon from spoken/written language.

•  Sign language animations

can improve accessibility of information and services for deaf individuals with low spoken/written language literacy.

•  State of art sign language animation tools focus mostly on the accuracy of manual signs, not facial expressions.

•  Facial expressions reveal linguistically significant information in sign language. When applied they can indicate important grammar information about phrases.


Facial Expressions in Sign Language

Lexical

•  May involve mouth patterns derived from the spoken language.

Modifier or Adverbial

•  Often co-occurs with a predicate to semantically modify its meaning.


LATE NOT-‐YET ‘MM’ ‘OO’ Source:www.handspeak.com

Facial Expressions in Sign Language

Syntactic •  Convey grammatical

information during entire syntactic phrases.

•  Is constrained by the timing and scope of the manual signs in a phrase.

Paralinguistic

•  Include affective and prosodic behaviors.

•  It is not linguistically constrained in time by the manual signs.



Sign Language Animation Technologies Your sister’s favorite restaurant is called Bob’s Diner. All the food’s cheap?

•  Scripted by a developer who knows ASL (provided a dictionary of signs).

•  Generated by the computer, based on some input source.

“The Forest: An ASL Story”

•  Animated by a Deaf 3D Animator (key-framed animation by hand).

Time to create: se

veral m

in.

Time to create: se

veral days/mon

ths.

Source:Hurdich, 2008 (YouTube)

Facial Parameterization

FACS (Ekman and Friesen, 1978)

•  Describes the muscle activity in a human face using a list of fundamental actions units (AUs).

MPEG-4 (ISO/IEC 14496-2, 1999) •  A face is controlled by setting

values for 68 Facial Action Parameters, which are normalized displacements of landmarks, based on face proportions.


14

2.1.1 Facial Action Coding System (FACS)

FACS, adopted by Ekman and Friesen (1978), describes the muscle activity in a human face using a

list of fundamental actions units (AUs). AUs refer to the visible movements affecting facial

appearance and are mapped to underlying muscles in a many-to-many relation. For example, AU12 is

named Lip Corner Puller and corresponds to movements of the zygomaticus major facial muscle as

shown in Fig. 7. FACS has been adopted by many facial animation systems (e.g. Weise et al., 2011).

However, in computer vision, automatic detection of AUs receives low scores in the range of 63%

(Valstar et al., 2012). Some inherent challenges of FACS are a) some AUs affect the face in opposite

direction thus conflicting with each other, and b) some AUs hide the visual presence of others.

a b

Figure 7: A FACS action unit example: (a) AU12 Lip Corner Puller (source: ISHIZAKI Lab) mapped to (b)

zygomaticus major facial muscle (source: Wikipedia).

2.1.2 MPEG-4 Facial Animation (MPEG-4 FAPS)

The MPEG-4 compression standard (ISO/IEC 14496-2, 1999) includes a 3D model-based coding for

face animation specified by 68 Facial Animation Parameters (FAPs) for head motion, eyebrow, nose,

mouth, and tongue controls that can be combined for representing natural facial expressions. The

values of these parameters are defined as the amount of displacement of characteristic points in the

face from their neutral position (Fig. 8a) normalized by scale factors (Fig. 8b), which are based on the

proportions of a particular human (or virtual human) face. Thus, the use of these scale factors allows

15

for interpretation of the FAPs on any facial model in a consistent way. For example, FAP 30 is named

“raise_l_i_eyebrow” and is defined as the vertical displacement of the feature point 4.1 (visible in Fig.

8a) normalized by the scale factor ENS0 (Fig. 8b). MPEG-4 facial animation allows for an integrated

solution for performance-driven animations, where facial features are extracted from the recording of

multiple humans and applied across multiple MPEG-4 compatible avatars.

a b

Figure 8: MPEG-4 facial animation (a) feature points and (b) scale factors.

2.1.3 Proprietary Facial Parameterization

Driven by particular research questions and phenomena to be investigated, researchers have often

adopted proprietary parameters to describe facial movements. For example, some sign language

linguists have used normalized eyebrow height without distinguishing between the left and right

eyebrow or between the inner, middle, and outer points of the eyebrow (Grossman and Kegl, 2006).

Computer science researchers investigating the facial movement during lexical facial expressions have

adopted similar approaches. Schmidt et al. (2013) tracked higher level of facial features such as mouth

vertical and horizontal openness, left and right eyebrow states. A limitation was the sensitivity of the

features to the facial anatomy of the recorded person.

Facial Features Extraction

Marker-based Facial Features •  Example: Motion capture using

small reflective dots affixed to key face locations.

•  (-): Cannot be applied on existing recordings of native signers.

•  (-): Prevents reuse of videos because recordings require that “dots” are affixed to the face.

Marker-free Facial Features •  Example: Extract key facial

movement features using computer vision.

•  (-): Does not compensate for obstacles in front of the face, a frequent phenomenon in sign language with the hands performing in the face area.

Hernisa Kacorri Graduate Center CUNY 8/35 Source: SoEimage® Face Robot® Source: Metaxas, D., Rutgers University

Challenges in Evaluating Sign Language Facial Expression Animations

•  A common agreement among researchers is the importance of involving native signers in the evaluation process.

•  Challenges: –  Signers may not

consciously notice a facial expression during a sign language passage.

–  Some of the facial expressions can affect the meaning in a way that is difficult to measure with a single comprehension question.


SQmuli Design

Screening Protocols

Comprehension QuesQons Design

Upper Baseline


•  Critique and Comparison of Five Selected Projects – Paper Selection Criteria – Paper Critiques for Each Project – Overall Comparison



Paper Selection Criteria •  We researched online indexes and other databases for

all academic papers on sign language animation. •  We selected animation projects in which: –  linguistic sign language facial expressions are

supported – data from recordings of native signers are used to

drive the animations –  facial expressions animation are being evaluated in

user studies

•  We are especially interested in studies that use facial features from multiple recordings to drive their facial expression models.


Five Selected Projects


Project/Attributes

HamNoSys-based

VCom3D

DePaul

SignCom

ClustLexical

Sign Language

European Sign Languages

American Sign Language

American Sign Language

French Sign Language

German Sign Language

Type of Facial Expressions

Lexical, Modifiers, Syntactic, and Paralinguistic

Modifiers, Syntactic, and Paralinguistic

Syntactic and Paralinguistic (Affective)

Lexical, Modifiers, Syntactic and Paralinguistic

Lexical

Involvement of Deaf People

Corpus Annotation & User Study

Linguistic Insights

Research Team Member

Corpus Collection

Corpus Annotation

Evaluation

User Study User Study User Study User Study Similarity-based score

Extract from Improvement and expansion of a system for translating text to Sign Language Chapter 5. Representation of the signs

Written by prof. Rubén San-Segundo Hernández Translated from Spanish-to-English by Robert Smith http://lorien.die.upm.es/~lapiz/ http://www.computing.dcu.ie/~rsmith/

30 of 39

PLURAL sign representation

1.6 LIMITATIONS ON THE REPRESENTATION OF THE SIGNS

While performing the task of generation of the signs have been some VGuido environmental constraints in the representation of signs. This section describes these limitations, and solutions used to represent the signs concerned.

The animated agent is not designed to withstand some of the transcripts HamNoSys, so ignored, or not to move. Similarly, when a sign on HamNoSys transcribed with an error of notation, generates a file. SiGML the animated agent can not represent, so that does not produce any movement.

x Generation of signs with both hands

When building signs with both hands, we must define the shape, orientation, location and movement for each hand. If the behavior of both hands are identical, the description is HamNoSys as for the signs with one hand, but preceded by a symmetry. However, if some aspect hands behave differently, this should be indicated by entering a description HamNoSys between two brackets separated by the

symbol (Ë). The description of the dominant hand is on the left of the symbol

(Ë), and that of the nondominant hand on the right.

For the animated agent represents the sign correctly, the location symbols can not be introduced in the same bracket that symbols of shape and orientation to be introduced in different brackets.

For example, the symbol "PUT" HamNoSys transcribed in the shape, orientation and location described by the following notation:

prior published work in which this was done). Therefore, we

consider prior work on ASL videos to determine the eye-tracking

metrics we should examine and the hypotheses we should test.

While Muir and Richardson [27] did not study sign language

animation, they observed changes in proportional fixation time on

the face of signers when the visual difficulty of videos varied.

Thus, we decided to examine the proportional fixation time on the

signer's face. Since there is some imprecision in the coordinates

recorded from a desktop-mounted eye-tracker, we decided not to

track the precise location of the signer's face at each moment in

time during the videos. Instead, we decided to define an AOI that

consists of a box that contains the entire face of the signer in

approximately 95% of the signing stories. (We never observed

the signer’s nose leaving this box during the stories.) Details of

the AOIs in our study can be found in section 4.

The problem with examining only the proportional fixation time

metric is that it does not elucidate whether the participant: (a)

stared at the face for a long time and then stared at the hands for a

long time or (b) often switched their gaze between the face and

the hands during the entire story. Both types of behaviors could

produce the same proportional fixation time value. Thus, we also

decided to define a second AOI over the region of the screen

where the signer's hands may be located, and we record the

number of “transitions” between the face AOI and the hands AOI

during the sign language videos and animations.

Since prior researchers have recorded that native signers viewing

understandable videos of ASL focus their eye-gaze almost

exclusively on the face, we make the supposition that if a

participant spends time gazing at the hands (or transitioning

between the face and hands), then this might be evidence of non-

fluency in our animations. It could indicate that the signer’s face

is not giving the participant useful information (so there is no

value in looking at it), or it could indicate that the participant is

having some difficulty in recognizing the hand shape/movement

for a sign (so participants need to direct their gaze at the hands).

In [7], less skilled signers were more likely to transition their gaze

to the hands of the signer. If we make the supposition that this is

a behavior that occurs when a participant is having greater

difficulty understanding a message, then we would expect more

transitions in our lower-quality or hard-to-understand animations

or videos. While [7] also noted eye-gaze at locative classifier

constructions by both skilled and unskilled signers, the stimuli in

our study do not contain classifier constructions (complex signs

that convey 3D motion paths or spatial arrangements).

Based on these prior studies, we hypothesize the following:

x H1: There is a significant difference in native signers’ eye-

movement behavior between when they view videos of ASL

and when they view animations of ASL.


movement behavior when they view animations of ASL with

some facial expressions and when they view animations of ASL

without any facial expressions.

x H3: There is a significant correlation between a native signer’s

eye movement behavior and the scalar subjective scores (grammatical, understandable, natural) that the signer assigns to

an animation or video.


eye movement behavior and the signer reporting having noticed a facial expression in a video or animation.


eye movement behavior and the signer correctly answering

comprehension questions about a video or animation.

Each hypothesis above will be examined in terms of the following

two eye-tracking metrics: proportional fixation time on the face

and transition frequency between the face and body/hands. Based

on the results of H1, we will determine whether to consider video

separately from animations for H3 to H5. Similarly, results from

H2 will determine if animations with facial expressions are

considered separately from animations without, for H3 to H5.

4. USER STUDY To evaluate hypotheses H1-H5, we conducted a user study, where

participants viewed short stories in ASL performed by either a

human signer or an animated character. In particular, each story

was one of three types: a “video” recording of a native ASL

signer, an animation with facial expressions based on a “model,”

and an animation with a static face (no facial expressions) as

shown in Fig. 1. Each “model” animation contained a single ASL

facial expression (yes/no question, wh-word question, rhetorical

question, negation, topic, or an emotion), based on a simple rule:

apply one facial expression over an entire sentence, e.g. use a

rhetorical-question facial expression during a sentence asking a

question that doesn’t require an answer. Additional details of the

facial expressions in our stimuli appear in [20, 24].

Fig. 1: Screenshots from the three types of stimuli: i) video of human signer, ii) animation with facial expressions, and iii) animation without facial expressions.

A native ASL signer wrote a script for each of the 21 stories,

including one of six types of facial expressions. To produce the

video stimuli, we recorded a second native signer performing

these scripts in an ASL-focused lab environment, as illustrated in

[24]. Then another native signer created both the model and no

facial expressions animated stimuli by consulting the recorded

videos and using some animation software [33]. The video size,

resolution, and frame-rate for all stimuli were identical.

During the study, after viewing a story, each participant

responded to three types of questions. All questions were

presented onscreen (embedded in the stimuli interface) as HTML

forms, as shown in Fig. 2, to minimize possible loss of tracking

accuracy due to head movements of participants between the

screen and a paper questionnaire. On one screen, they answered 1-

to-10 Likert-scale questions: three subjective evaluation questions

(of how grammatically correct, easy to understand, and naturally

moving the signer appeared) and a “notice” question (1-to-10

from “yes” to “no” in relation to how much they noticed an

emotional, negative, questions, and topic facial expression during

the story). On the next screen, they answered four comprehension

questions on a 7-point Likert scale from “definitely no” to

“definitely yes.” Given that facial expressions in ASL can

differentiate the meaning of identical sequences of hand

movements [28], both stories and comprehension questions were

engineered in such a way that the wrong answers to the

comprehension questions would indicate that the participants had

misunderstood the facial expression displayed [20]. E.g. the

comprehension-question responses would indicate whether a

participant had noticed a “yes/no question” facial expression or

instead had considered the story to be a declarative statement.

These themes limit the material that can be discussed inelicitation sessions to a narrow vocabulary. Discussed inlong interactions, the signer provides a large number of to-kens relative to the narrow focus. The Cocktail story sec-tion, measuring roughly one third of the overall corpus,contains the tokens shown in Table 1, among others.With this variety and frequency of cocktail-related lexemes,we are able to produce a number of novel utterances aroundthe same subject. For example, Figure 1 shows a sequencewe have constructed from various single signs and signphrases. The final result is interpreted as

I asked the next friend what (s)he wanted.(S)he responded, “eh, I don’t like fruity drinks,so I don’t really know. What do you suggest?”“I’d suggest a cocktail named Cuba Libre,” I said.I gave it to her and (s)he took it.“Great!”

Note that constructing this utterance requires selectingsigns from various parts of the corpus. The movementsof two signs were inverted phonologically to evoke a con-trary meaning. The purposeful inclusion of such directionalsigns was intended for such an utterance, and is detailed inSection 3.3., below.Finally, as there is a necessary balance of control withinvariability for avatar projects, signing avatar corpora do notprovide the level of variation needed for a sociolinguisticstudy.

Table 1: The tokens of highest occurrence in the Cocktailstory section of the SignCom corpus.

14x WHAT 7x WANT9x VARIOUS 4x FILL8x COCKTAIL 3x JUICE8x DRINK (n.) 3x ORANGE8x EVENING 3x VODKA8x FRUIT 2x RUM8x POUR 2x SUGGEST7x GLASS

SUIVANT TOI VOULOIRQUOI (c/r) EUH MOI AIMER-PAS BOISSON FRUIT

TOI PROPOSER-1 QUOI (c/r) MOIPROPOSER-2 COCKTAIL NOMGUILLEMENTS CUBA LIBRE

DONNER PRENDRE GENIALE

(c/r)

Figure 1: Signs can be rearranged to create novel phrases.Here, signs are retrieved from two different recording takes(white and gray backgrounds) and linked with transitionscreated by our animation engine (striped background). Thesign AIMER (“like”) is reversed to create AIMER-PAS(“dislike”), as is DONNER (“give”) to create PRENDRE(“take”). Finally, a role shift, shown as (c/r), is includedin one transition to ensure discourse accuracy and compre-hension.

3.2. Open-Ended vs. ScriptedAnonymity in contributions to signed language corpora hasbeen an important conversation within the Deaf commu-nities that support this type of research. At the most ba-sic level, given the face’s active involvement in the signingevent it is impossible to hide the identity of the signer. Lin-guistic data has thus been subject to tight controls regardingrights releases to allow data analysis among researchers, aswell as data publishing to wider and/or public audiences.This topic becomes even more sensitive when open-endedquestions are used to elicit stories for linguistic corpora.Existing corpora use guiding topics to elicit personal re-sponses, which may include reports of abuse or other ille-gal activities; eventually such data would require censor-ship when making corpora public. As signing avatars al-most inevitably become publicly viewable, researchers aimto avoid controversial topics in recording sessions.As an added benefit, the avatar medium aides in anonymiz-ing elicited data by providing a new face and body for thesigner. Figure 2 shows our language consultant alongsidethe avatar that replays her signing in our animation system.

Figure 2: Avatars provide new identities to signers withoutcovering the face, an important articulator for the signingevent.

3.3. Experiments in altering phonologicalcomponents

Our specific research interests brought us to include a num-ber of indicating verbs and depicting verbs in the SignComcorpus. Among our scientific inquiries are the questions ofwhether playing reversible indicating verb motions back-wards will be convincing and whether altering the hand-shape of a stored depicting verb will be understood as achange in meaning.For example, an LSF signer can reverse the movementof the LSF sign AIMER (“like”) to produce the meaning

hal-0

0505

182,

ver

sion

1 -

22 J

ul 2

010

Active Appearance Models

I track salient points on the faceI extract high-level facial features:

. mouth vertical openness

. mouth horizontal openness

. lower lip to chin distance

. upper lip to nose distance

. left eyebrow state

. right eyebrow state

. gap between eyebrowsI necessary: labeled data

Schmidt, Koller Enhancing Gloss-Based Corpora

3 / 13

19.10.2013

HamNoSys-based Project



30 of 39











Input: a detailed symbolic description of hand and face movements. Output: an animation.

Elliott et al. (2004); Jennings et al. (2010) Hernisa Kacorri Graduate Center CUNY 13/35

Approach

Pros Cons

•  Detailed user input is required– no automatic synthesis.

•  Eyebrows, eyelids, and nose are grouped: to control one you must fully specify all.

•  Does not allow a facial expression to be applied over multiple signs.

•  It is difficult to synchronize mouthing to the manual signs.


26

scheme was not part of the original design and was begun in the last version (4.0) of HamNoSys.

Researchers are working on how to best represent the non-manual channel in SiGML and the

animation software that supports it, e.g. JASigning (Ebling and Glauert, 2013; Jennings et al., 2010).

Despite their progress, their work always assumes an input describing the sequence of facial

expressions and changes over one or more signs; so far they have not focused on automatic synthesis

through inference from multiple data/instances.

4.1.1 Approach

Gestural SiGML is an XML representation of the linear HamNoSys notation, with a structure similar

to that of abstract syntax trees, containing additional information about the speed or duration of

signing. In HamNoSys, a sign is transcribed linearly with iconic symbols, extending from 5 to 25

symbols from 200 “sub-lexical” units that are not language specific, describing its hand shape,

orientation, location in the 3D space, and a number of actions as illustrated in Fig. 10.

Figure 10: An example of the HAMBURG sing transcribed in HamNoSys. (source: Hanke, 2010)

In HamNoSys 4.0, non-manual information is supported in additional tiers synchronized to the

manual movements and separated by: Shoulders, Body, Head, Gaze, Facial Expression, and Mouth.

SiGML follows a similar structure for the representation of non-manuals. In particular, focusing on

the face only, the Facial Expression category includes information about the eyebrows, eyelids, and

nose (Fig. 11a), and the mouth category includes a set of static mouth pictures based on the Speech

Methods Phonetic Alphabet (SAMPA) (Wells, 1997) to be used for lexical facial expressions, and a



30 of 39











•  Make use of pre-existing annotated corpora in HamNoSys.

•  The notation can represent multiple sign languages.

•  It supports a large set of symbols for representing mouth patterns.

Data Resources

Corpora

•  Domain oriented video collection, e.g.: –  train announcements –  transportation route

description –  description of places and

activities

•  DictaSign (2014) is working on: –  story telling –  discussion –  negotiation

Cons In HamNoSys, the eyebrow height in a syntactic facial expression applied over multiple signs is described as raised, lowered, or neutral over each of the signs in the sentence without specifying dynamic changes within the duration of a sign or between the signs. •  Facial expressions solely

annotated in HamNoSys cannot be used directly to train models of facial movements.

•  A greater precision to the dynamic changes in facial movements at a fine-grained time scale is needed.




30 of 39











Evaluation Focus Group Ebling and Glauert (2013)

•  7 native signers •  9 German Sign Language

train announcements (+rhetorical questions)

(-): Comprehensibility of the animations was not evaluated.

The study indicated time synchronization issues between the facial expressions and the hand movements.

User Study Smith and Nolan (2013)

•  15 participants •  5 Irish Sign Language stories (+

7 emotions)

(-): no upper baseline was included (e.g. video of human signer). The study found that the enhancement of the avatars with facial expressions did not increase comprehensibility (a small decrease was observed instead though not significant).




30 of 39











VCOM3D Software Input: A user-built timeline of words and overlaying facial expressions selected from a vocabulary. Output: An animation.

Witt et al. (2003); Hurdich (2008) Hernisa Kacorri Graduate Center CUNY 17/35
















































































































Approach

Pros •  Includes a library of ASL facial

expressions that can be applied over a single sign or multiple manual signs with internal transition rules.

Cons

•  Small repertoire of facial expressions available.

•  User cannot adjust intensity or invent a new one.

•  The system does not allow for overlapping or co-occurring facial expressions.

•  Simplistic idea of time-warping: hold a static or simply loop the key-frames.


32

The main limitation of VCom3D’s approach to facial expressions is the lack of sufficient

expressive control for the facial expressions in their sentence-scripting interface. The system does not

allow for overlapping or co-occurring facial expressions. For example, an animated ASL sentence

where a yes/no question facial expression applied throughout the sentence cannot convey emotion at

the same time, include adverbial facial expressions, or a negation. Such combinations are necessary in

fluent ASL sentences.

Figure 14: An example of a timing diagram that shows how a facial expression consisted of three key-frames is

applied to the movements of the hands (linear interpolation between the key-frames, looped time warping, transitions, and

holds). (source: DeWitt et al., 2003)

4.2.2 Data Resources

The authors do not explicitly mention the data sources they used to create the facial expressions,

though it is likely that it was an animator who created them with the guidance of ASL videos where

these facial expressions were performed and the support of a native ASL signer in the VCOM3D team.

U.S. Patent Mar. 18,2003 Sheet 12 0f 28 US 6,535,215 B1

FIG. 11

1101 11 02 1103 /\ a /\ r /\

Transition Action t Hold Phase Phase Phase

Action Frame Number 0 1 2 3 4 NA

V <— 1105 Action Animation l I I I I | | l I l I l I I I l |

4— 1106 Looping Expression Animation —>

8 0 1 2 NE1 2 NE1 2 N512 Expression Frame Number

1104 Transition Start Time /

1100 Animation Engine Cycle
















































































































Data Resources

(-): The authors do not explicitly mention the data sources they used to create the facial expressions.

It is likely that it was an animator who created the facial expressions with the guidance of ASL videos where these facial expressions were performed and the support of a native ASL signer in the VCOM3D team.

















































































































Evaluation User Study (Sims, 2000; Sims and Silverglate, 2002; Hurdich, 2008)

•  Deaf children age 5-6 (classroom setting) The authors say that “English comprehension among young Deaf learners improved from 17% to 67%”.

(-): The study did not describe the methodology or how they calculated these numbers.

















































































































DePaul Project Input: linguistic findings and video analysis for eyebrow movements. Output: animated example sentences with syntactic facial expressions with or without co-occurrence of affect.

Wolfe et al. (2011); Schnepp et al. (2010, 2012) Hernisa Kacorri Graduate Center CUNY 21/35

Approach

Pros •  Use of exemplar curves for

eyebrow movements based on linguistic findings from videos of native signers.

•  Co-occurrence of the syntactic and affective facial expressions.

•  Artistic facial wrinkling to reinforce eyebrow signal.

Cons •  Assumes symmetrical vertical

movements for both left and right eyebrow.

•  No detailed controls (inner, middle, and outer eyebrow).

•  No horizontal movements of eyebrows.

•  An artist created animations that follow the exemplar.

•  No time warping evidence – 2 example sentences only.

•  No head, eye aperture, and nose movements in modeling.


35

syntactic facial expressions also involve other facial and head parameters, such as head position, head

orientation, and eye aperture. For example, while topic and yes/no-questions share similar eyebrow

movements, they can be differentiated based on the head movements. However, the authors have not

extended their animations models for these controls.

Figure 15: The ASL sentence “How many books do you want?” (source: Wolfe et al., 2011)

4.3.2 Data Resources

In addition to discussing earlier ASL linguistic research (e.g., Boster, 1996; Wilbur, 2003; and

Crasborn et al. 2006) that had investigated the contribution of eyebrow movements and their intensity

in syntactic and affective facial expressions, the authors consolidated the work of Grossman and Kegl

(2006) and Weast (2008) that provide a greater precision to the dynamic changes in eyebrow vertical

position (an example is shown in Fig. 16).

Grossman and Kegl (2006) recorded 2 native signers performing 20 ASL sentences in 6 different

ways based on the facial expression category to be investigated such as neutral, angry, surprise,

quizzical, y/n question, and wh-question. Then they averaged the common eyebrow vertical

movements (among other features) for each of the facial expression category. One limitation of

Grossman and Kegl’s approach (as used by the DePaul researchers), is that Grossman and Kegl could

have benefited from applying time warping techniques before averaging, since their sentences under

consideration had different time durations.

Data Resources

Grossman and Kegl (2006) •  2 native signers performing 20

ASL sentences in 6 different ways (neutral, angry, surprise, quizzical, y/n question, and wh-question)

(-): time warping is needed before averaging given that the sentences varied in length.

Weast (2008)

•  In the presence of some types of emotional affect, the eyebrow height range for the yes/no-questions and wh-questions is compressed.

(-): the compression factor 25% in Wolfe et al. (2011) is not well justified.


36

Wolfe et al. (2011) also based their animation algorithm for handling co-occurrence of syntactic

and emotional facial expression on the findings of Weast (2008), who found that in the presence of

some types of emotional affect, the eyebrow height range for the yes/no-questions and wh-questions is

compressed. However, it seems that the selection of the numerical compression factor in Wolfe et al.

(2011) animation algorithm was arbitrary.

Figure 16: Eyebrow height on wh-questions, angry, and quizzical expressions. (source: Grossman and Kegl, 2006)

4.3.3 Evaluation

To test the feasibility of their approach, the authors conducted a user study (Schnepp et al. 2010, 2012)

with an ASL sentence (shown in Fig. 15) where the wh-question co-occurred with a positive emotion,

such as happiness, and a negative emotion, such as anger. Participants were asked to repeat the

sentence and assign a graphical Likert-scale score for the emotional state and a 1-5 Likert-scale score

for its clarity. Both studies were limited to the same, single short stimulus. A bigger cardinality and

diversity in the stimuli set (different length sentences, different location of the wh-word, etc.) would be

a requirement for a statistical analysis. The authors could also have benefited by including in their

study a lower baseline to compare with their animations, e.g. a wh-question with neutral emotion state,

or by including videos of a native signer as an upper baseline for comparison. These enhancements to

the study design would have made their results more comparable with future work. Another

Evaluation User Study (Schnepp et al. 2010, 2012)

•  40 participants (20 online) •  2 stimuli

–  wh-question (happiness, anger) –  “Cha” (happiness, anger)

•  Scores: –  Repeat the sentence –  Emotional state Likert-scale score –  Clarity Likert-scale score

Cons •  A bigger cardinality and

diversity in the stimuli is necessary for statistical analysis.

•  No baseline (lower or upper). •  The two treatments differ not

only on the eyebrow and wrinkling but also mouth shape.

•  No comprehension questions – repeat of stimulus is a non-conventional evaluation approach.


37

methodological concern is that it appears that in Schnepp et al. (2012), the facial expressions of the

two stimuli (happy vs. angry) did not differ only in their eyebrow position and wrinkling. They also

differed in the mouth shapes that conveyed the emotion (Fig. 17). This would make it rather difficult

to conclude that the participants perceived the intended affect in animations solely due to the quality of

the author’s co-occurrence algorithm for the eyebrow movements. The mouth, a point on the face

where Deaf people tend to focus during signing (Emmorey et al., 2009), could have driven the results

instead.

Figure 17: Co-occurrence of wh-question with emotions (source: Schnepp et al., 2012).

4.4 SignCom

The SignCom project seeks to build an animation system that combines decomposed motion capture

data from human signers in French Sign Language. The system architecture, proposed by Gibet et al.

(2011), incorporates a multichannel framework that allows for on-line retrieval from a motion-capture

database of independent information for each of the different parts of the body (e.g., hands, torso,

arms, head, and facial features) that can be merged to compose novel utterances in French Sign

Language. Their focus on synthesis of facial expressions lies at the level of mapping facial mocap

markers to values of animation controls in the avatar’s face (these puppetry controls for the face are

sometimes referred to as “blendshapes”), which are designed by an animator to configure the facial

geometry of the avatar, e.g. vertical-mouth-opening, left-eye-closure.

Combining Emotion and Facial Nonmanual Signals in Synthesized American Sign Language

Jerry Schnepp, Rosalee Wolfe, John C. McDonald

School of Computing, DePaul University 243 S. Wabash Ave., Chicago, IL 60604 USA

+1 312 362 6248

{jschnepp,rwolfe,jmcdonald}@cs.depaul.edu

Jorge Toro Department of Computer Science Worchester Polytechnic Institute

100 Institute Road Worcester, MA 01609 USA

[email protected]

ABSTRACT Translating from English to American Sign Language (ASL) requires an avatar to display synthesized ASL. Essential to the language are nonmanual signals that appear on the face. Previous avatars were hampered by an inability to portray emotion and facial nonmanual signals that occur at the same time. A new animation system addresses this challenge. Animations produced by the new system were tested with 40 members of the Deaf community in the United States. For each animation, participants were able to identify both nonmanual signals and emotional states. Co-occurring question nonmanuals and affect information were distinguishable, which is particularly striking because the two processes can move an avatar’s brows in opposing directions.

Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Processing – language generation, machine translation; K.4.2 [Computers and Society]: Social Issues – assistive technologies for persons with disabilities.

General Terms Design, Experimentation, Human Factors, Measurement.

Keywords Accessibility Technology, American Sign Language

1. INTRODUCTION An automatic English-to-ASL translator would help bridge the communication gap between the Deaf and hearing communities. Text-based translation is incapable of portraying the language of ASL. A video-based solution lacks the flexibility needed to dynamically combine multiple linguistic elements. A better approach is the synthesis of ASL as animation via a computer-generated signing avatar. Several research efforts are underway to portray sign language as 3D animation [1][2][3][4], but none of them have addressed the necessity of portraying affect and facial nonmanual signals simultaneously.

2. FACIAL NONMANUAL SIGNALS Facial nonmanual signals appear at every linguistic level of ASL [5]. Some nonmanual signals carry adjectival or adverbial information. Figure 1 shows the adjectival nonmanuals OO (small) and CHA (large) demonstrated by our signing avatar.

Nonmanual OO – “small size” Nonmanual CHA – “large size”

Figure 1: Nonmanual signals indicating size Other nonmanuals operate at the sentence level [6]. For example, raised brows indicate yes/no questions and lowered brows indicate WH-type (who, what, when, where, and how) questions.

Affect is another type of facial expression which conveys emotion and often occurs in conjunction with signing. While not strictly considered part of ASL, Deaf signers use their faces to convey emotions [7]. Figure demonstrates how a face can convey affect and a WH-question simultaneously.

WH-question, happy WH-question, angry

Figure 2: Co-occurrence

Copyright is held by the author/owner(s). ASSETS’12, October 22–24, 2012, Boulder, Colorado, USA. ACM 978-1-4503-1321-6/12/10.

249

SignCom Project Input: motion capture data from body and facial movements. Output: puppetry animation of the avatar’s face movements driven by the motion capture data using machine learning.

Gibet et al. (2011) Hernisa Kacorri Graduate Center CUNY 25/35









(c/r)






hal-0

0505

182,

ver

sion

1 -

22 J

ul 2

010

Approach Pros

•  Dimensionality reduction to learn the corresponding 50 blendshape weights from the 123 features of a human.

•  Detailed facial features (FACS-like) –  43 markers in the 3D space: 123

features

Cons

•  Puppetry control of the avatar based on one human signer – no automatic synthesis.

•  Tongue movements are not captured.

•  Marker-occlusion may occur.










(c/r)






hal-0

0505

182,

ver

sion

1 -

22 J

ul 2

010

38

4.4.1 Approach

The authors recorded the facial movement of native signers with 43 facial motion capture markers

(Fig. 18a) resulting in 123 features when considering the marker’s values in the 3D space. The values

of the markers (calculated in a common frame) were normalized based on their relative distance to the

upper nose sensors considered by the authors to remain unchanged during most of the face

deformations. To map these features to the values of 50 blendshapes in the geometrical model of their

avatar the authors considered probabilistic inference and used Gaussian Process Regression to learn the

corresponding blendshape weights from the mocap values. As discussed in Section 2, this approach

wouldn’t be necessary had the facial features been extracted in an MPEG-4 format and used to drive an

MPEG-4 compatible avatar. It is also unclear whether this approach would require motion data

recorded from different signers (different face geometry, signing style, etc.) to be treated separately.

Also, the use of motion capture sensors is a time consuming approach for recording a big corpus of

facial expressions when compared to the alternative of applying computer-vision software to pre-

existing video recordings of native signers.

a

b

Figure 18: (a) Motion-capture sensors on a native signer’s face and (b) blended faces of the avatar driven by the

values of the facial markers. (source: Gibet et al., 2011)

6:8 S. Gibet et al.

Fig. 2. Left, our native signer poses with motion capture sensors on her face and hands; right, our virtualsigner in a different pose.

whether or not it contacts another part of the hand. This calls for notable accuracy inthe motion capture and data animation processes.

Nonmanual components. While much of our description focuses on hand configurationand motion, important nonmanual components are also taken into account, such asshoulder motions, head swinging, changes in gaze, or facial mimics. For example, eyegaze can be used to recall a particular object in the signing space; it can also benecessary to the comprehension of a sign, as in READ(v), where the eyes follow themotion of fingers as in reading. In the case of facial mimics, some facial expressionsmay serve as adjectives (i.e., inflated cheeks make an object large or cumbersome, whilesquinted eyes make it thin) or indicate whether the sentence is a question (raisedeyebrows) or a command (frowning). It is therefore very important to preserve thisinformation during facial animation.

3.3. Data Conditioning and AnnotationThe motion capture system used to capture our data employed Vicon MX infraredcamera technology at frame rates of 100 Hz. The setup was as follows: 12 motioncapture cameras, 43 facial markers, 43 body markers, and 12 hand markers. The photoat left of Figure 2 shows our signer in the motion capture session, and at right we showthe resulting virtual signer.

In order to replay a complete animation and have motion capture data available foranalysis, several postprocessing operations are necessary. First, finger motion was re-constructed by inverse kinematics, since only the fingers’ end positions were recorded.In order to animate the face, cross-mapping of facial motion capture data and blend-shape parameters was performed [Deng et al. 2006a]. This technique allows us toanimate the face directly from the raw motion capture data once a mapping patternhas been learned. Finally, since no eye gazes were recorded during the informant’sperformance, an automatic eye gaze animation system was designed.

We also annotated the corpus, identifying each sign type found in the mocap datawith a unique gloss so that each token of a single type can be easily compared. Otherannotations follow a multitier template which includes a phonetic description of thesigns [Johnson and Liddell 2010], and their grammatical class [Johnston 1998]. Thesephonetic and grammatical formalisms may be adapted to any sign language and there-fore the multimodal animation system, which uses a scripting language based on suchlinguistics models, can be used for other sign language corpora and motion databases.

ACM Transactions on Interactive Intelligent Systems, Vol. 1, No. 1, Article 6, Pub. date: October 2011.

The SignCom System for Data-Driven Animation of Virtual Signers 6:15

Fig. 10. Results of the facial animation system. Some examples of faces are shown, along with the corre-sponding markers position projected in 2D space.

In our approach, unknown sites correspond to new facial marker configurations (asproduced by the previously described composition process), and the corresponding es-timated value is a vector of blendshape weights. Since the dimensions of the learningdata are rather large (123 for marker data and 50 for the total amount of blendshapesin the geometric model we used), we rely on an online approximation method of thedistribution that allows for a sparse representation of the posterior distribution [Csatoand Opper 2002]. As a preprocess, facial data is expressed in a common frame thatvaries minimally with respect to face deformations. The upper-nose point works wellas a fixed point relative to which the positions of the other markers can be expressed.Secondly, both facial mocap data and blendshape parameters were reduced and cen-tered before the learning process.

Figure 10 shows an illustration of the resulting blended faces along with the differentmarkers used for capture.

4.5. Eye AnimationOur capture protocol was not able to capture the eye movements of the signer, eventhough it is well-known that the gaze is an important factor of nonverbal communica-tion and is of assumed importance to signed languages. Recent approaches to modelthis problem rely on statistical models that try to capture the gaze-head coupling[Lee et al. 2002; Ma and Deng 2009]. However, those methods only work for a limitedrange of situations and are not adapted to our production pipeline. Other approaches,like the one of Gu and Badler [2006], provide a computational model to predict visualattention. Our method follows the same line as we use a heuristic synthesis model thattakes the neck’s motion as produced by the composition process as input and generateseye gazes accordingly. First, from the angular velocities of the neck, visual targets areinferred by selecting times when the velocity passes below a given threshold for a giventime period. Gazes are then generated according to those targets such that eye motionsanticipate neck motion by a few milliseconds [Warabi 1977]. This anticipatory mech-anism provides a baseline for eye motions, to which glances towards the interlocutor(camera) are added whenever the neck remains stable for a given period of time. This adhoc model thus integrates both physiological aspects (modeling of the vestibulo-ocularreflex) and communication elements (glances) by the signer. Figure 11 shows two ex-amples of eye gazes generated by our approach. However, this simple computationalmodel fails to reproduce some functional aspects of the gaze in signed languages, suchas referencing elements in the signing space. As suggested in the following evaluation,this factor was not critical with regards to the overall comprehension and believ-ability of our avatar, but can be an area of enhancement in the next version of ourmodel.


38

4.4.1 Approach














a

b



6:8 S. Gibet et al.














Data Resources SignCom corpus (Duarte and Gibet, 2010) •  Three scripted elicitation sessions – each included 50-200 signs – a narrow vocabulary (~50 unique signs) – multiple occurrence of particular signs

(-): One signer (-): The authors do not describe the categories of facial expressions in their corpus.










(c/r)






hal-0

0505

182,

ver

sion

1 -

22 J

ul 2

010

38

4.4.1 Approach














a

b



6:8 S. Gibet et al.














Evaluation User Study (web-based)

•  25 Participants •  Stimuli: 3 passages

–  A: manually synthesized expr. –  B: data-driven facial expr. –  C: no facial expr.

•  They did not find any statistically significant differences.

Cons •  Categories of facial expressions

in the stimuli are not mentioned. –  Lack of syntactic facial

expressions in the stimuli could explain the non-significant differences.

•  No comprehension questions. Just marked the preference of two things side-by-side.

•  Relatively small number of stimuli.

•  Don’t explain the qualifications of the person who manually made the synthesized facial expressions.










(c/r)






hal-0

0505

182,

ver

sion

1 -

22 J

ul 2

010

ClustLexical Project Input: a ID-gloss annotated corpus with translations Output: a corpus with refined annotations for signs that differ in mouth patterns based on facial features clustering. Future animation: use a representative video from the cluster.

Schmidt et al. (2013)












3 / 13

19.10.2013

Clustering Approach

I Align corpus

I Extract variants

I Cluster variantsSL ! Spoken

EVENING RIVER THREE MINUS SIX MOUNTAIN

Tonight three degrees at the Oder, minus six degrees at the Alps .

EVENING_tonight

RIVER_Oder

MOUNTAIN_Alps

EVENING_evening

RIVER_Rhein

MOUNTAIN_mountains

MOUNTAIN_Alps

MOUNTAIN_Eifel

MOUNTAIN_Erzgebirge

MOUNTAIN_Berge

MOUNTAIN_Alps

MOUNTAIN_Alps

MOUNTAIN_Eifel

MOUNTAIN_Eifel

MOUNTAIN_Eifel

MOUNTAIN_ErzgebirgeMOUNTAIN_Erzgebirge

MOUNTAIN_Berge

MOUNTAIN_BergeMOUNTAIN_Berge


7 / 13

19.10.2013

Approach Approach Details •  Tracks salient points on the

face and extracts high-level facial features, e.g.: –  mouth horizontal openness –  lower lip to chin distance

•  Cluster videos with the same ID_gloss based on the facial features (similarity with HMMs)

•  Select representative video to synthesize face animation (medoid of the biggest cluster)

Cons

•  Could have had better results if time warping were applied to signs of different length. –  Dynamic Time Warping can

help for the initialization for HMM clustering (Oates et al., 1999).

•  Interesting to see if averaging approach would have worked better than the selection of a centroid –  eliminate some noise –  less signer variant












3 / 13

19.10.2013

(source: slides of Schmidt et al., 2013)

Data Resources RWTH-Phoenix-Weather Corpus

•  Video corpus with weather forecasts in German Sign Language –  2711 sentences –  vocabulary of 463 ID-glosses –  7 native signers

•  Annotated with –  ID-glosses, –  ID-gloss boundaries in the

video –  translation in German –  time boundaries of the

translated sentences in the video.

Cons

•  No facial features on tongue position.

•  Their computer vision feature extraction approach does not generalize well across different human signers and provides poor results when hands move in-front of the face.












3 / 13

19.10.2013

Evaluation

Cluster Evaluation •  On a held-out subset of

manually labeled pairs: –  precision, –  recall, –  F-measure.

Cons

•  No animation synthesis. •  No time adjustment to the

manual movements. •  No user study.












3 / 13

19.10.2013

Evaluation of Representative

•  Accuracy: the fraction of the labeled videos that have the same label as the medoid of the cluster they belong to.

Clustering results

1 32 640

20406080

100

Prec

isio

n [%

]

(gloss,translation)

avg:65.3%

1 32 640

20406080

100

Rec

all [

%]

(gloss,translation)

avg:82.6%

1 32 640

20406080

100

F−M

easu

re [%

]

avg:67.8%

(gloss,translation)

I Precision: only same mouthings are in same clusterI Recall: only different mouthings are in different clustersI F-Measure: geometric mean of precision and recall


9 / 13

19.10.2013

Clustering results: biggest cluster

1 32 640

20406080

100

Accu

racy

[%]

(gloss,translation)

avg:78.4%

I Accuracy: medoid has same mouthing as other cluster membersI The overall algorithm achieves accuracy of 78.4%


10 / 13

19.10.2013


•  Critique and Comparison of Five Selected Papers – Paper Selection Criteria – Paper Critiques for Each Project – Overall Comparison




Project/

Attributes

HamNoSys-based

VCom3D

DePaul

SignCom

ClustLexical

Synthesis Pipeline

Generates facial expr. from a user detailed description.

Generates facial expr. based on category of user selection.

Generates facial expr. based on linguistics driven models.

Maps motion capture data to facial blendshapes.

Selects a representative video to drive the animation.

Corpus N/A N/A Prior linguistic analysis of corpora.

SignCom corpus (3 long dialogues)

RWTH-Phoenix- Weather

Input HamNoSys ‘Empirical’ + videos?

Linguistic findings

Motion Capture

Extracted Features

Facial Parameters

Proprietary SAMPA

Proprietary Proprietary FACS-like MPEG4-like

Time Adjustment

User Defined Static/repetitive with transition rules.

Linear time warping is indicated.

No time warping (puppetry).

N/A facial expressions were not animated



30 of 39


































































































































(c/r)






hal-0

0505

182,

ver

sion

1 -

22 J

ul 2

010











3 / 13

19.10.2013

Current and Future Work Evaluation Methodology for Sign Language Animations

•  Stimuli and comprehension question design. (Kacorri et al. 2013c; Huenerfauth and Kacorri, 2014)

•  Video/animation upper baseline. (Lu and Kacorri, 2012; Kacorri et al., 2013b)

•  Eye-tracking metrics as an alternative evaluation measure. (Kacorri et al., 2013a; Kacorri et al., 2014)

•  Automatic scoring approaches that correlate with participants’ feedback.

Data-driven Modeling of Syntactic Facial Expression

•  MPEG-4 parameterization for both the extracted features and avatar controls. (Kacorri and Huenerfauth, 2014)

•  Time adjustment of facial expressions to the manual signs.

•  Use linguistic insights to select a representative video.

•  Use machine learning to obtain representative facial motion curves.


SignCom: Cross-mapping of facial mocap and blendshapes

Gaussian Process Regression: Given p observations X = (X(l1), . . . , X(lp))T localized at the li sites, one looks at the estimation of X(lk) at a given unobserved localization lk. –  p observations: 123 facial

features from 43 markers –  estimated value: a vector

of blendshape weights.

Hernisa Kacorri Graduate Center CUNY 36

−1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

x

y

Figure 2: The solid line indicates an estimation of for 1,000 values of . Pointwise95% confidence intervals are shaded.

4 GPR IN THE REAL WORLD

The reliability of our regression is dependent on how well we select the covariancefunction. Clearly if its parameters — call them — are not cho-sen sensibly, the result is nonsense. Our maximum a posteriori estimate of occurswhen is at its greatest. Bayes’ theorem tells us that, assuming we have littleprior knowledge about what should be, this corresponds to maximizing ,given by

(10)

Simply run your favourite multivariate optimization algorithm (e.g. conjugate gradi-ents, Nelder-Mead simplex, etc.) on this equation and you’ve found a pretty goodchoice for ; in our example, and .

It’s only “pretty good” because, of course, Thomas Bayes is rolling in his grave.Why commend just one answer for , when you can integrate everything over the manydifferent possible choices for ? Chapter 5 of Rasmussen andWilliams (2006) presentsthe equations necessary in this case.

Finally, if you feel you’ve grasped the toy problem in Figure 2, the next two exam-ples handle more complicated cases. Figure 3(a), in addition to a long-term downwardtrend, has some fluctuations, so we might use a more sophisticated covariance function:

(11)

The first term takes into account the small vicissitudes of the dependent variable, andthe second term has a longer length parameter ( ) to represent its long-term

4

ClustLexical: Clustering •  Model each facial feature (7 total) by a separate Hidden

Markov model (HMM). •  #states= frame length of feature sequence •  Single state garbage model: for co-articulation effects

optionally inserted at the beginning or end of a sequence. •  Model training: Single Gaussian densities, a globally

pooled covariance matrix, global state transition penalties and the EM-algorithm with Viterbi approximation and maximum likelihood criterion are employed in a nearest neighbor fashion.

•  Free HMM parameters (e.g. time distortion penalties): optimized in an unsupervised manner using the German translation as weak labels.

Hernisa Kacorri Graduate Center CUNY 37

Documents

A Survey and Critique of Facial Expression Synthesis in ... · 10/11/2014 · A Survey and Critique of Facial Expression Synthesis in Sign Language Animation Hernisa Kacorri SECOND