10
1344 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 9, SEPTEMBER 2012 Animating Lip-Sync Characters With Dominated Animeme Models Yu-Mei Chen, Fu-Chun Huang, Shuen-Huei Guan, and Bing-Yu Chen, Senior Member, IEEE Abstract —Character speech animation is traditionally consid- ered as important but tedious work, especially when taking lip synchronization (lip-sync) into consideration. Although there are some methods proposed to ease the burden on artists to create facial and speech animation, almost none is fast and efficient. In this paper, we introduce a framework for synthesizing lip-sync character speech animation in real time from a given speech sequence and its corresponding texts, starting from training dominated animeme models (DAMs) for each kind of phoneme by learning the character’s animation control signal through an expectation—maximization (EM)-style optimization approach. The DAMs are further decomposed to polynomial-fitted animeme models and corresponding dominance functions while taking coarticulation into account. Finally, given a novel speech sequence and its corresponding texts, the animation control signal of the character can be synthesized in real time with the trained DAMs. The synthesized lip-sync animation can even preserve exaggerated characteristics of the character’s facial geometry. Moreover, since our method can perform in real time, it can be used for many applications, such as lip-sync animation prototyping, multilingual animation reproduction, avatar speech, and mass animation production. Furthermore, the synthesized animation control signal can be imported into 3-D packages for further adjustment, so our method can be easily integrated into the existing production pipeline. Index Terms—Animeme modeling, character animation, coar- ticulation, dominated animeme model (DAM), lip synchronization (lip-sync), speech animation. I. Introduction W ITH the popularity of 3-D animation and video games, facial and speech character animation is becoming more important than ever. MPEG-4 even defined facial an- imation as one of its key features [1]. There are many technologies allowing artists to create high quality character animation, but facial and speech animation is still difficult to sculpt because the correlation and interaction of the muscles Manuscript received August 29, 2011; revised December 7, 2011 and January 31, 2012; accepted February 6, 2012. Date of publication May 30, 2012; date of current version August 30, 2012. This paper was recommended by Associate Editor H. Wang. Y.-M. Chen and B.-Y. Chen are with National Taiwan Univer- sity, Taipei 10617, Taiwan (e-mail: [email protected]; [email protected]). F.-C. Huang is with the Department of Electrical Engineering and Com- puter Sciences, University of California, Berkeley, CA 94720 USA (e-mail: [email protected]). S.-H. Guan is with the Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei 10617, Taiwan, and also with Digimax, Inc., Taipei 11510, Taiwan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2012.2201672 on the face are very complicated. Some physically based simulation methods are provided to approximate the muscles on the face, but the computational cost is very high. A less flexible but affordable alternative is the performance-driven approach [2]–[5], in which the motions of an actor are cross- mapped and transferred to a virtual character (see [6] for further discussion). This approach has been successful, but the captured performance is difficult to re-use, so a new performance is required whenever creating a new animation or speech sequence. Manual adjustment, where artists are requested to adjust the face model controls frame-by-frame and compare the results back-and-forth, is still a popular approach. When creating facial and speech character animation, it is challenging to have a character model’s lips synchronized. It is a labor-consuming process, and even requires millisecond- precise key-framing. Given a spoken script, the artist has to first match the lips’ shapes with their supposed positions. The transitions from word-to-word or phoneme-to-phoneme, a.k.a. coarticulation, play a major role in speech animation and need to be adjusted carefully [7]. Coarticulation is the phenomenon that a phoneme can influence the mouth shapes of the previous and next phonemes. In other words, the mouth shape depends on not only the current phoneme but also its context, including at least the previous and next phonemes. As opposed to simple articulated animation that can be key- framed with linear techniques, coarticulation is nonlinear and difficult to model. In this paper, we propose a framework to synthesize lip synchronization (lip-sync) character speech animation in real time. For each phoneme, one or multiple dominated animeme models (DAMs) are first learned via clustering from a training set of speech-to-animation control signals (e.g., the character controls used in Maya or cross-mapped mocap lip-motions). A DAM is the product of a latent dominance function and an intrinsic animeme function, where the former controls coarticulation and the latter models the mouth shape in the subphoneme accurately. The two entangled functions are learned and decomposed through an EM-style solver. In the synthesis phase, given a novel speech sequence, the DAMs are used to synthesize the corresponding speech-to- animation control signal to generate the lip-sync character speech animation automatically, so it can be integrated into the existing animation production pipeline easily. Moreover, since our method can synthesize acceptable and robust lip-sync animation in real time, it can be used in many applications 1051-8215/$31.00 c 2012 IEEE

Animating Lip-Sync Characters With Dominated Animeme Models

  • Upload
    bing-yu

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

1344 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 9, SEPTEMBER 2012

Animating Lip-Sync Characters With DominatedAnimeme Models

Yu-Mei Chen, Fu-Chun Huang, Shuen-Huei Guan, and Bing-Yu Chen, Senior Member, IEEE

Abstract—Character speech animation is traditionally consid-ered as important but tedious work, especially when taking lipsynchronization (lip-sync) into consideration. Although there aresome methods proposed to ease the burden on artists to createfacial and speech animation, almost none is fast and efficient. Inthis paper, we introduce a framework for synthesizing lip-synccharacter speech animation in real time from a given speechsequence and its corresponding texts, starting from trainingdominated animeme models (DAMs) for each kind of phonemeby learning the character’s animation control signal throughan expectation—maximization (EM)-style optimization approach.The DAMs are further decomposed to polynomial-fitted animememodels and corresponding dominance functions while takingcoarticulation into account. Finally, given a novel speech sequenceand its corresponding texts, the animation control signal ofthe character can be synthesized in real time with the trainedDAMs. The synthesized lip-sync animation can even preserveexaggerated characteristics of the character’s facial geometry.Moreover, since our method can perform in real time, it canbe used for many applications, such as lip-sync animationprototyping, multilingual animation reproduction, avatar speech,and mass animation production. Furthermore, the synthesizedanimation control signal can be imported into 3-D packages forfurther adjustment, so our method can be easily integrated intothe existing production pipeline.

Index Terms—Animeme modeling, character animation, coar-ticulation, dominated animeme model (DAM), lip synchronization(lip-sync), speech animation.

I. Introduction

W ITH the popularity of 3-D animation and video games,facial and speech character animation is becoming

more important than ever. MPEG-4 even defined facial an-imation as one of its key features [1]. There are manytechnologies allowing artists to create high quality characteranimation, but facial and speech animation is still difficult tosculpt because the correlation and interaction of the muscles

Manuscript received August 29, 2011; revised December 7, 2011 andJanuary 31, 2012; accepted February 6, 2012. Date of publication May 30,2012; date of current version August 30, 2012. This paper was recommendedby Associate Editor H. Wang.

Y.-M. Chen and B.-Y. Chen are with National Taiwan Univer-sity, Taipei 10617, Taiwan (e-mail: [email protected];[email protected]).

F.-C. Huang is with the Department of Electrical Engineering and Com-puter Sciences, University of California, Berkeley, CA 94720 USA (e-mail:[email protected]).

S.-H. Guan is with the Graduate Institute of Networking and Multimedia,National Taiwan University, Taipei 10617, Taiwan, and also with Digimax,Inc., Taipei 11510, Taiwan (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2012.2201672

on the face are very complicated. Some physically basedsimulation methods are provided to approximate the muscleson the face, but the computational cost is very high. A lessflexible but affordable alternative is the performance-drivenapproach [2]–[5], in which the motions of an actor are cross-mapped and transferred to a virtual character (see [6] forfurther discussion). This approach has been successful, butthe captured performance is difficult to re-use, so a newperformance is required whenever creating a new animationor speech sequence. Manual adjustment, where artists arerequested to adjust the face model controls frame-by-frameand compare the results back-and-forth, is still a popularapproach.

When creating facial and speech character animation, it ischallenging to have a character model’s lips synchronized. Itis a labor-consuming process, and even requires millisecond-precise key-framing. Given a spoken script, the artist has tofirst match the lips’ shapes with their supposed positions.The transitions from word-to-word or phoneme-to-phoneme,a.k.a. coarticulation, play a major role in speech animationand need to be adjusted carefully [7]. Coarticulation is thephenomenon that a phoneme can influence the mouth shapesof the previous and next phonemes. In other words, the mouthshape depends on not only the current phoneme but also itscontext, including at least the previous and next phonemes.As opposed to simple articulated animation that can be key-framed with linear techniques, coarticulation is nonlinear anddifficult to model.

In this paper, we propose a framework to synthesize lipsynchronization (lip-sync) character speech animation in realtime. For each phoneme, one or multiple dominated animememodels (DAMs) are first learned via clustering from a trainingset of speech-to-animation control signals (e.g., the charactercontrols used in Maya or cross-mapped mocap lip-motions).A DAM is the product of a latent dominance function andan intrinsic animeme function, where the former controlscoarticulation and the latter models the mouth shape inthe subphoneme accurately. The two entangled functions arelearned and decomposed through an EM-style solver.

In the synthesis phase, given a novel speech sequence, theDAMs are used to synthesize the corresponding speech-to-animation control signal to generate the lip-sync characterspeech animation automatically, so it can be integrated intothe existing animation production pipeline easily. Moreover,since our method can synthesize acceptable and robust lip-syncanimation in real time, it can be used in many applications

1051-8215/$31.00 c© 2012 IEEE

CHEN et al.: ANIMATING LIP-SYNC CHARACTERS WITH DOMINATED ANIMEME MODELS 1345

for which prior techniques are too slow, such as lip-syncanimation prototyping, multilingual animation reproduction,avatar speech, mass animation production, and so on.

To summarize the contributions of this paper, the followingfacts are given.

1) A framework is proposed to synthesize lip-sync charac-ter speech animation in real time.

2) Instead of generating hard-to-adjust vertex deformationslike other approaches, a high-level control signal of 3-D character models is synthesized. Hence, our synthesisprocess can be more easily integrated into the existinganimation production pipeline.

3) We present the DAM, which fits coarticulation better bymodeling the animation control signal in subphonemeprecision with the product of a latent dominance func-tion and an intrinsic animeme function.

4) Multiple DAMs are used to handle large intra-animemevariations.

II. Related Work

Face modeling and facial or speech animation are broadtopics; [6] and [7] provided good surveys. In this section, weseparate the face modeling and the specific modeling for lipsin the discussion.

A. Facial Animation and Modeling

Most facial animation and modeling methods can be cate-gorized into parameterized or blend-shape, physically-based,data-driven, and machine-learning approaches. For parameter-ized or blend-shape modeling, faces are parameterized intocontrols; the synthesis is done manually or automatically viacontrol adjustment. Previous work on linear blend-shape [8]–[10], face capturing or manipulation (FaceIK) [11], and facecloning or cross-mapping [12]–[16] provided a fundamentalguideline for many extensions. However, their underlyingmathematical frameworks indeed have some limitations, e.g.,the faces outside the span of examples or parameters cannotbe realistically synthesized, and these techniques require anexcessive number of examples. Other methods reduce theinterference between the blend-shapes [17] or enhance thecapabilities of cross-mapping to animate the face models [18].

Physically-based methods [19], [20] simulate the muscleson the face, and the underlying interaction forms the subtlemotion on the skin. The advantage of the physically-basedmethods over the parameterized or blend-shape ones is ex-tensibility; the faces can be animated more realistically, andthe framework allows interaction with objects. However, themuscle simulation is very expensive, and hence reduces theapplicability to an interactive controller.

Data-driven methods [21] construct a database from a verylarge training dataset of faces. The synthesis of novel facialanimation is generated by searching the database and mini-mizing the discontinuity between successive frames. Given thestarting and ending example frames, the connecting path in thedatabase forms newly synthesized facial animation. However,they have to deal with missing training data or repetitiveoccurrence of the same records.

Machine-learning techniques base their capabilities on thelearned statistical parameters from the training samples. Previ-ous methods [22]–[25] employed various mathematical modelsand can generate new faces from the learned statistics whilerespecting the given sparse observations of the new data.

In our system, we adopt the blend-shape facial basis basedon facial action coding system [8] to form the speech-to-animation controls to drive the 3-D face models easily. Bymerging the advantages of the data-driven and machine-learning techniques, we construct a lip-shape motion controldatabase to drive speech activities and moreover generatenew lip-sync motions. Unlike other previous methods thatdirectly use training data to synthesize results, our approachcan synthesize natural lip shapes that did not appear in thetraining data set.

B. Lip-Sync Speech Animation

Many speech animation methods derive from the facialanimation and modeling techniques. The analysis of thephonemes under the context of speech-to-face correspondence,a.k.a. viseme, is the subject of many successful works. Manyprevious methods addressed this issue with Spline generation,path-finding, or signal concatenation.

Parameterized or blend-shape techniques [26]–[28] forspeech animation are the most popular methods because oftheir simplicity. Sifakis et al. [29] presented a physically-based approach to simulate the speech controls based on [20]for muscle activation. This method can interact with objectswhile simulating, but there is still the problem of simula-tion cost. Data-driven approaches [21], [30] form a graphfor searching the given sentences. Like similar approaches,they used various techniques, e.g., dynamic programming,to optimize the searching process. Nevertheless, they stillsuffer from missing data or duplicate occurrences. Machine-learning methods [31]–[35] learn the statistics for phoneme-to-animation correspondences, which is called animeme, in orderto connect animation up to speech directly and reduce thesesearching efforts.

Lofqvist [36] and Cohen and Massaro [37] provided a keyinsight into decomposing the speech animation signal intotarget values (mouth shapes) and latent dominance functionsto model the implicit coarticulation. In subsequent work, thedominance functions are sometimes reduced to a diphoneor triphone model [33] for simplicity. However, the originalframework shows some examples (e.g., the time-locked orlook-ahead model) that are difficult to explain by the simplerdiphone or triphone model. Their methods are later extendedby Cosi et al. [38] with resistance functions and shape func-tions, which is the basic concept of the animeme.

Some recent methods [29], [34], [35] used the concept ofanimeme, a shape function, to model the subviseme signalto increase the accuracy of phoneme fitting. Kim and Ko[34] extended [31] by modeling the viseme within a smallersubphoneme range with a data-driven approach. Coarticulationis modeled via a smooth function in their regularizationwith the parameters found empirically. However, it has toresolve conflicting and insufficient records in the training set.Sifakis et al. [29] modeled the muscle-control-signal animeme

1346 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 9, SEPTEMBER 2012

(they call it physeme) for each phoneme, and concatenatethese animemes for words. They found that each phoneme hasvarious similar animemes with slight variations due to coar-ticulation, which is modeled with linear cross-fade weightingin a diphone or triphone fashion.

Kshirsagar et al. [39] presented a different approach tomodel coarticulation by using the visyllable. Each syllablecontains at least one vowel and one or more consonants. Itrequires about 900 demi-visyllables for the system in their ex-periments, and therefore the approach needs a huge database.Wampler et al. [35] extended the multilinear face model [25] toderive new lip shapes for a single face model. Coarticulation ismodeled by minimizing the lips’ positions and forces exerted.However, it is usually unnecessary to sample the face tensorspace to produce a single-speech segment. Moreover, the facetensor space also inherits the curse of dimensionality, whichis also a difficult topic for facial capturing.

We learned from many successful previous methods and im-proved the deficiencies in them. The analysis in the subviseme,or so-called animeme, space has significant improvementsover the viseme analysis. In addition, we also solve forthe hidden dominance functions, and extend coarticulationbeyond the simpler diphone or triphone model. Moreover,the synthesis process is much simpler and faster because themodels used for generating the results are trained in an offlineprepass.

III. DAMs

To animate a character (face) model from a given script(phonemes), it is necessary to form the relationship betweenthe phonemes and animation control signal C(t), which iscalled animeme, which means the animation representationof the phoneme. However, due to coarticulation, it is hardto model the animeme with a simple function, so we modelthe animation control signal C(t) with a product of twofunctions: the animeme function and its dominance function.The animeme function controls the intrinsic mouth shapeswhen used alone without other phonemes. When putting wordstogether, it is necessary to concatenate several phonemestogether, and the dominance functions of the animemes controltheir individual influence and falloff, and hence coarticulation.Mathematically, one DAM is modeled as follows:

C(t) = D(t)A(t) t ∈ [−∞, ∞]

where the animeme function A(t) is modeled with a high de-gree polynomial function to simulate the relationship betweenphonemes and lip shapes, and the dominance function D(t) ismodeled via a modified exponential function, which is usedto simulate coarticulation.

Some previous literature [33], [36] described the dominancefunction as a bell-shape function. That means, although ourlip shape is mainly affected by the current phoneme, the lipshape is also affected by the neighboring phonemes. Inspiredby [37], if the time is within the activation of the phoneme (i.e.,t ∈ [0, 1]), then the animeme has full influence. Exponentialfalloff is applied when time is outside the activation period of

the phoneme

D(t) =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

1, t ∈ [0, 1]

exp

( −t2

σ2 + ε

), t < 0

exp

(− (t − 1)2

σ2 + ε

), t > 1

(1)

where σ is the phoneme specific parameter affecting the rangeof influence, and ε is a small constant to prevent dividing byzero.

Putting multiple phonemes together to get the full sequenceof the animation control signal, we simply concatenate theseDAMs with the summation of their normalized values

C∗(t) =J∑

j=1

Cj(tj) =∑

j

Dj(tj)Aj(tj) (2)

where j = 1, 2, ..., J indicates the jth phoneme in the givenphoneme sequence, and tj = (t−sj)/dj is the normalized localtime for each phoneme activation, where sj is the starting timestamp of the jth phoneme and dj is its duration. Generally, inthe dominance function of an animeme, the falloff controlsits influence beyond its phoneme activation period. Strongcoarticulation has slow falloff and vice versa. Note that thephonemes farther away from the current phoneme may havevery little contribution to it, so the influence of the DAMs farfrom it is relatively small.

One major observation besides the above description is theintra-animeme variations. In fact, some phonemes stronglydepend on lip shapes. By performing the unsupervised clus-tering [40], we found that some phonemes can have multipleDAMs which we call them modes; the choice of which mode touse depends on the speech context. This finding coincidentallyagrees with many successful data-driven methods.

In the subsequent sections, we will use the DAMs and givea system that learns and synthesizes speech animation se-quences. To learn the parameters for modeling animemes andtheir dominance functions, multiple modes of each phonemeare first found by affinity propagation [40]. Then, an EM-stylesolver is performed to learn the DAM parameters for eachmode, specifically the polynomial coefficients for animemefunctions and the falloff controls [σ in (1)] for dominancefunctions. Once the parameters are learned, we can synthesizethe animation control signal given a novel speech sequence andits corresponding texts. The given texts provide the guide tochoose an individual DAM for each phoneme, and the chosenDAMs are then concatenated with (2).

IV. Overview

Fig. 1 shows our system flowchart. The system has twophases: training (left) and synthesis (right). In the trainingphase, the system takes as input the captured lip motionsor the animation control signal of the character (face) modeldirectly. The animation control signal is usually used to drivethe motions of a character model in the modeling tools, likeMaya. If we choose the lip-tracking result from a speechvideo or 3-D lip motions captured by a mocap facility, the

CHEN et al.: ANIMATING LIP-SYNC CHARACTERS WITH DOMINATED ANIMEME MODELS 1347

Fig. 1. System flowchart.

data in the vertex domain will be first cross-mapped to thecontrol signal domain (discussed in the Appendix). If thereexists acceptable lip-sync character animation, the capturingand cross-mapping processes can be omitted and the speech-to-animation control signal from the existing artist-sculpted orcaptured speech animation can be used directly.

Then, the speech and its corresponding texts are alignedwith SPHINX-II [41] to obtain the aligned scripts (phonemesequence), which contain phonemes with their starting timestamps and durations in the speech. The aligned scripts andanimation control signal C(t) are used as training examplesto construct the DAMs (Section V) for future novel speechanimation synthesis.

In the synthesis phase, we take as input a novel speech andits corresponding texts, and use SPHINX-II again to obtain thealigned scripts. From the scripts, the DAMs are concatenatedto generate the animation control signal C∗ (Section VI).Finally, the animation control signal C∗ is used to animatethe character (face) model in Maya or similar modeling toolsto generate the lip-sync character speech animation.

The core components in the system are the learning modulefor constructing and modeling the DAMs and the synthesismodule for generating the animation control signal C∗, whichwill be explained in Sections V and VI, respectively.

V. Learning DAMs

A. Learning Modes for Phonemes

According to the aligned scripts (phoneme sequence), everyphoneme can have many corresponding animation controlsignals. Based on these training examples, we can constructthe phoneme’s DAM(s). However, we found that it is difficultto decouple the animeme function and its dominance functiongracefully if we construct a single DAM for each phoneme dueto large intra-animeme variations. Instead, multiple DAMs, ormodes, for each phoneme are used. The choice of modes in aspeech sequence depends on the speech context.

The training animation control signal for each phonemeis first fitted and reconstructed with a cubic Spline

interpolation, while the duration of the phoneme is parameter-ized to t ∈ [0, 1]. Then, an unsupervised clustering algorithm,affinity propagation [40], is used to cluster the training controlsignal into some modes; the quantity of the clustering isdetermined automatically.

Note that the idea of modes is not new; data-driven ap-proaches synthesize animation by searching animation clipswithin the database. These kinds of methods have to deal withrepetitive clips. The use of which clips depends on smoothtransitions and user-specified constraints, which are similar toour choices of modes. In the synthesis phase (Section VI), wewill discuss the mode selection in more detail.

B. Estimating Animeme Function

Assuming each mode of each phoneme appears in thesequence exactly only once and denoting the jth dominancefunction Dj(i) at time i as a fixed value Di

j , the estimationof the polynomial function Aj(t) can be reduced to findthe polynomial coefficients a0

j , a1j , ..., a

Mj . Then, (2) can be

rewritten as follows:

C(i) =J∑

j=1

Dij

[M∑

m=0

amj (tij)m

](3)

where tij = (i− sj)/dj is the normalized local time stamp fromthe activation of the jth phoneme.

Since we want to find the coefficients a0j , a

1j , ..., a

Mj for each

phoneme j (M = 4 in our implementation), in a regressionmanner, we can set the partial derivative of regression error Rwith respect to the mth coefficient am

j for the jth phoneme tozero. The least-squares fitting for regression is

fi = C(i) −J∑

j=1

Dij

[M∑

m=0

amj (tij)m

]

R = FT F =n∑

i=0

⎛⎝C(i) −

J∑j=1

Dij

[M∑

m=0

amj (tij)m

]⎞⎠

2

(4)

where F is the column-concatenated vector formed for eachelement fi. Since the unknowns am

j are linear in F, theproblem is essentially a linear least-squares fitting. By settingall partial derivatives to zero and arranging (4), we can obtainthe following matrix representation:

D =

⎡⎢⎢⎢⎣

D11(t1

1)0 · · · D11(t1

1)M · · · D1J · · · D1

J (t1J )M

D21(t2

1)0 · · · D21(t2

1)M · · · D2J · · · D2

J (t2J )M

.... . .

... · · · .... . .

...Dn

1(tn1 )0 · · · Dn1(tn1 )M · · · Dn

J · · · DnJ (tnJ )M

⎤⎥⎥⎥⎦

A =[

a01 · · · aM

1 · · · a0J · · · aM

J

]T

C =[

C0 C1 C2 · · · Cn]T

where D is the dominance matrix, A is the coefficient vectorwe want to solve, and C is the observed values at each time i,so the minimum error to the regression fitting can be written inthe standard normal equation with the following matrix form:

(DT D)A = DT C (5)

1348 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 9, SEPTEMBER 2012

where D is an n× ((M + 1) × J) matrix, C is an n vector, andA is an (M + 1) × J vector to be solved.

If we remove the assumption that each mode of eachphoneme appears exactly once, multiple occurrences of eachmode of a phoneme have to be fitted to the same value. Hence,we can rearrange the multiple occurring terms and make iteasier to solve. For example, if phoneme 1 (with only onemode) appears twice as the first and third phonemes in thephoneme sequence, then (3) becomes

C(i) = Di11

A11 (ti11) + Di

2A2(ti2) + Di12

A12 (ti12) + ...

=[Di

11+ Di

12

]a0

1 +[Di

11(ti11

) + Di12

(ti12)]a1

1 + ...

+ Di2a

02 + Di

2a12(ti2) + Di

2a22(ti2)2 + ... (6)

where 11 and 12 indicate the first and second times thephoneme 1 appeared. Note that the polynomial coefficients am

j

of the animeme function Aj(t) are the same and independentof the occurrences.

By the above rearrangement, we can remove the originalassumption that each mode of each phoneme can appearexactly only once, and rewrite the original term in (3) withthe summation of each occurrence Hj of the same mode ofphoneme j as follows:

Dij(tij)m ⇒

∑Hj

Dijh

(tijh)m (7)

where jh denotes the hth time occurrence of the mode ofphoneme j.

C. Estimating Dominance Function

In the previous section to estimate the animeme functionAj(t) of the jth phoneme, we assumed that its dominancefunction Dj(t) is known and fixed. In this section, we willdescribe how to estimate the dominance function Dj(t) overthe regression, given that the animeme value Aj(i) at time i

is known and fixed, denoted as Aij . Back to the definition of

the dominance function formulated in (1), for phoneme j, itsinfluence control is affected by σj , which is unknown now.

Here, we want to minimize the regression (4) again aswe did in the previous section. However, since the parameterσj for regression is nonlinear, we need a more sophisticatedsolver. The standard Gauss–Newton iterative solver is usedto approach the minimum of the regression error R. As wedefined the residual error in the previous section, the Gauss–Newton algorithm linearizes the residual error as follows:

fi = C(i) −J∑

j=1

Dj(tij)Aij

F(σj + δ) ≈ F(σj) + Jδ (8)

where tij =(i − sj

)/dj is the normalized local time, F is

formed by fi but takes as input the influence control σj for thejth phoneme, δ is the updating step for the gradient direction ofthe Gauss–Newton solver, and J is the Jacobian matrix. Eachiteration of the Gauss–Newton algorithm solves a linearized

problem to (4), and after removing the terms that do notdepend on δ, we get the following:

JT J δ = −JT F

σk+1j = σk

j + δ. (9)

The Gauss–Newton algorithm repeatedly optimizes the regres-sion error by updating δ to σk

j at the kth iteration, and achieveslinear convergence.

D. Learning With Iterative Optimization

In the previous two sections, we showed how to minimizethe regression error by estimating the animeme function Aj(t)and its hidden dominance function Dj(t). Since the entireformulation is not linear and cannot be solved intuitively,we employed an EM-style strategy that iterates between theestimation of the animeme function Aj(t) and the optimizationfor the dominance function Dj(t).

The E-step involves estimating the polynomial coefficientsam

j for each animeme function Aj(t) by solving a linearregression using the standard normal equation.

The M-step tries minimizing the regression error to estimatethe influence controls σj by improving the nonlinear domi-nance function Dj(t).

First, when solving for the E-step, the initial influencecontrol parameters σj involved in Dj(t) are set to 1. Atthe M-step, where the Gauss–Newton algorithm linearizesthe function by iteratively updating the influence controls σj ,all parameters of the polynomial coefficients am

j are carriedfrom the first half of the iteration. The EM-style strategykeeps iterating between the E-step and M-step until no moreimprovement on regression error can be done. Convergenceof optimizing Dj(t) is fast, but the effect of estimating Aj(t)has more perturbation on σj . The number of iterations forconvergence is varying for different DAMs, which is directlyproportional to the quantity of the clustered control signals foreach DAM, but the process is an offline computation in thetraining phase separate from synthesis.

VI. Synthesizing With DAMs

In the synthesis phase, we want to generate the outputcontrol signal according to the input phoneme sequence. Sincesome phonemes may have multiple modes, we have to decidewhich mode should be used for each phoneme. Constructingthe output animation control signal requires selecting the mostsuitable mode for each phoneme, and then directly using (2)to concatenate the DAMs in the sequence.

Giving a phoneme sequence j = 1, 2, ..., J and possiblemodes DAM

gj (g = 1, ..., Gj , where Gj is the number of

modes) for each phoneme j, the animemes can form ananimeme graph as shown in Fig. 2. The selection of suitablemodes for the phoneme sequence can be treated as a graphsearch problem, and A* algorithm is used in our implemen-tation. Since we want to find a compromise between thelikelihood of the modes and the smoothness in the animation,the cost of each node in the animeme graph is set as follows:

E = wcEc + wsEs (10)

CHEN et al.: ANIMATING LIP-SYNC CHARACTERS WITH DOMINATED ANIMEME MODELS 1349

Fig. 2. Animeme-graph example for synthesizing “graph.” There are mul-tiple DAMs (modes) for one phoneme (with the same color). The suitablesequence (denoted by solid circles and lines) is selected by A* algorithm.

TABLE I

Models Used in This Paper

Model Vertex# Face# Control#Afro woman 5234 5075 7Boy 6775 6736 7Child 6991 6954 16Old hero 8883 8738 8Court lady 1306 1307 7

where Ec is a data term, which represents the likelihood of themode DAM

gj in the training set linked with its previous and

next phonemes, Es is the smoothness term computing the C2

smoothness on the joint frame of every DAMgj (g = 1, ..., Gj)

of the current phoneme j and every DAMgj−1 (g = 1, ..., Gj−1)

of its previous phoneme j − 1, and wc and ws are the weightsof the error terms. We used wc = 1000 and ws = 1 for all theresults in this paper.

VII. Results

The training set involves 80 sentences and about 10 min ofspeech context with unbiased content. In the training phase,constructing the DAMs costs about 50–60 min per controlon a desktop PC with an Intel Core2 Quad Q9400 2.66 GHzCPU and 4 GB memory. For synthesizing a lip-sync speechanimation, the animation control signal formed by our DAMsis generated in real time (i.e., 0.8 ms per phoneme on average).Table I shows the number of vertices, faces, and controls ofeach model, respectively, used in this paper.

Fig. 3 shows a comparison of the training data and thesynthesized results of our DAM, Cohen–Massaro model [37],and multi-dimensional morphable model (MMM) [31], on theAfro woman model saying “popular.” Fig. 4 shows a part ofsignal fitting for these results. The average L2-norms for DAM,Cohen–Massaro model, and MMM are 0.4724, 0.6297, and0.5023, respectively. This sequence represents continuous lipmotion, and the flow is from left to right. According to thetraining data, the lips should be closed during the phoneme “P”and opened for other phonemes appropriately. At the last frameof the sequence, the mouth closes to prepare for the next word.Note that the Cohen–Massaro model is implemented using ourDAM by setting M = 0 in (4), i.e., the polynomial form isreduced to only the constant term. The formulation of ourdominance function (1) is very similar to their original formbut with the flexible extension that the shapes of the phonemescan be varied. The reconstruction result of the Cohen–Massaro

Fig. 3. Comparison of (a) training data and (b) synthesized results of DAM,and (c) Cohen–Massaro model and (d) MMM, on Afro woman model saying“popular.”

Fig. 4. Comparison of the signal fitted in Fig. 3 by DAM, Cohen–MassaroModel, and MMM with the captured one. The y-axis shows one of thecoordinates of a control.

model is too smooth at some parts in the sequence such thatconsecutive phonemes are greatly influenced, i.e., they spantoo much. Therefore, the features of a few phonemes can beshown, but others are not as prominent as they should be.In contrast, our DAM spans more properly in a range withrespect to the training data. The MMM formulates the fittingand synthesis as a regulation problem. It fits each phoneme asa multidimensional Gaussian distribution and forms the wordsor sentences as a path going through these phoneme regionsby minimizing an energy function containing a target term anda smoothness term. The speech poses using MMM have goodtiming, but lack prominent features.

Fig. 5 shows two results of the Afro woman model saying“apple” and “girl.” As shown in the closeup view of the mouth,although the last phonemes of the two words are the same(“L”), the lip shapes are different due to coarticulation. Notethat the lip shape for pronouncing the phoneme “P” showsthe mouth closes well, although some other similar methodscannot due to the smoothness effect. One can also notice thatwhile pronouncing the phoneme “ER,” the tongue is rolled. Ingeneral, the shape change of the tongue is very hard to capture.However, since our method uses animation control signal asthe training data, once the target model is designed well forperforming such features, our synthesis results can also keepthese characteristics.

1350 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 9, SEPTEMBER 2012

Fig. 5. Afro woman model saying (a) “apple” and (b) “girl.”

Fig. 6. Old hero model saying “old man.”

Figs. 6 and 7 show two other results with different models;the old hero model says “old man” and the boy modelsays “top of the world.” All of them have their typicalstyles. Comparing the lip shapes of the two models whilepronouncing the phoneme “L,” their lip shapes performedin two different ways because of coarticulation and theirindividual characteristics. To keep the models’ characteristicsduring synthesis, our method is character (model) dependent.Each character’s DAMs should be trained by its own animationcontrol signal. Models with similar controls can use the sameDAMs. Of course, artists can also use another model’s trainedDAMs to make a prototype of a novel model, and then refinethe model for training its own DAMs. This can speed upthe training preparation time while keeping the quality of thetraining data.

VIII. Conclusion

In this paper, we proposed a new framework for synthesiz-ing lip-sync character speech animation in real time with a

given novel speech sequence and its corresponding texts. Ourmethod produced fairly nice transitions in time and generatedthe animation control parameters that were formed by ourDAMs, which were constructed and modeled from the trainingdata in subphoneme accuracy for capturing coarticulationwell. Through an EM-style optimization approach, the DAMswere decomposed to the polynomial-fitted animeme functions,and their corresponding dominance functions according tothe phonemes. Given a phoneme sequence, the DAMs wereused to generate the animation control signal to animate thecharacter (face) model in Maya or similar modeling toolsin real time while still keeping the character’s exaggeratedcharacteristics. Moreover, the DAMs were constructed by thecharacter controls instead of absolute lip shapes, so it canperform better training or synthesizing results and is suitable tobe integrated into the existing animation pipeline. By using thefacial animation parameters defined in MPEG-4 for trainingand synthesis, our approach can be easily extended to supportMPEG-4 facial animation [42].

CHEN et al.: ANIMATING LIP-SYNC CHARACTERS WITH DOMINATED ANIMEME MODELS 1351

Fig. 7. Boy model saying “top of the world.”

Even though the quality of the synthesized lip-sync char-acter speech animation may not be perfect as compared withthat of animation created manually by an artist, the synthesizedanimation can still easily be fine tuned, since the automaticallygenerated animation control signal is lip-synchronized andcan be used directly in Maya or similar animation tools. Byextending the phoneme dictionary, our method can also beused to produce multilingual lip-sync speech animation easily.Furthermore, since our method can synthesize acceptable androbust lip-sync character animation in real time, it can be usedfor many applications for which prior methods are inadequate,such as lip-sync animation prototyping, multilingual animationreproduction, avatar speech, and mass animation production.

Our model still has some weaknesses, for example, it cur-rently infers the dynamics of motion solely from the trainingdata set. If the training data set does not contain speech similarto the synthesis target, the results may be inaccurate. Forexample, if the training data set contains only ordinary speech,it will be unsuitable for synthesizing a singing character,because the typical phoneme behavior for singing a song variesgreatly from the ordinary speech and imposes more challengesfor dynamics modeling. A second weakness is that in ourDAMs, we used a function of Gaussian-based form to modelthe dominance functions. A potential problem is that whilesining a song, certain phonemes may extend indefinitely withdragging sounds. It is not only difficult for a speech recognizerto identify the ending time, but also the Gaussian-based formcannot accommodate such effects. One possible solution is tomodel the dominance functions with greater variability andnonsymmetric models.

APPENDIX

CROSS-MAPPING

Although the input of our system is an animation controlsignal, to ease the efforts for adjusting the character (lip)model, we also provide a method to cross map the captured lipmotion to the animation control signal. After the lip motionis captured, the key-lip shapes Lk are identified first, whichcan be pointed out by the artist or by using an unsupervised

clustering algorithm, e.g., [40]. The captured key lip shapes Lk

are then used to fit the captured lip motion Li for each framei by using the nonnegative least-squares (NNLS) algorithmto obtain the blending coefficients αi

k. This process can beexpressed as the following constrained minimization:

min ‖Li −K∑

k=1

αikLk‖2 ∀αi

k ≥ 0

where K is the number of identified key lip shapes. The aboveclustering and fitting process for the captured lip motion needsto be performed only once. If the target character model hassome well-defined bases, it is better to assign the key lipshapes to the bases manually, since the blending coefficientsαi

k can be used as the control signal Ci directly without furtherprocessing.

To cross map the input captured lip motion to the targetcharacter model, the identified key lip shapes Lk are firstused to guide the artist to adjust the vertices V on the lipsof the target character model to imitate the key lip shapesLk while keeping the character’s characteristics. The numberof adjusted vertices should be equal to or more than that ofthe character controls C (i.e., ‖V‖ ≥ ‖C‖) for solving theconstrained minimization in the next paragraph. Then, theblending coefficients αi

k are used to blend the adjusted lipvertices Vk for key lip shapes Lk to obtain the lip vertices Vi

for each frame i via

Vi =K∑

k=1

αikVk.

Instead of using the lip vertices Vi for training directly,for better training or synthesis results and animation pipelineintegration, the training and synthesizing are performed oncharacter controls. Assuming there are K animation controlsCk ∈ C, which can be used to drive the lip motions of thetarget character model, the animation control signal Ci

k foreach frame i and control k can be obtained by fitting the lipvertices Vi as the following constrained minimization:

min ‖Vi −K∑k=1

V(Cik)‖2

1352 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 9, SEPTEMBER 2012

where V(·) denotes the transfer function from the controlsignal to lip vertices, and each animation control Ci

k ∈ Ci

is constrained to [0, 1]. Again, it is solved by the NNLSalgorithm.

References

[1] G. Abrantes and F. Pereira, “MPEG-4 facial animation technology:Survey, implementation, and results,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 9, no. 2, pp. 290–305, Mar. 1999.

[2] L. Williams, “Performance-driven facial animation,” in Proc. SIG-GRAPH, 1990, pp. 235–242.

[3] B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin, “Makingfaces,” in Proc. SIGGRAPH, 1998, pp. 55–66.

[4] W.-C. Ma, A. Jones, J.-Y. Chiang, T. Hawkins, S. Frederiksen, P. Peers,M. Vukovic, M. Ouhyoung, and P. Debevec, “Facial performance syn-thesis using deformation-driven polynomial displacement maps,” ACMTrans. Graph., vol. 27, no. 5, p. 121, 2008.

[5] T. Weise, S. Bouaziz, H. Li, and M. Pauly, “Realtime performance-basedfacial animation,” ACM Trans. Graph., vol. 30, no. 4, pp. 77:1–77:10,2011.

[6] F. Pighin and J. P. Lewis, “Performance-driven facial animation: Intro-duction,” in Proc. SIGGRAPH Course Notes, no. 30. 2006.

[7] F. I. Parke and K. Waters, Computer Facial Animation, 2nd ed. Natick,MA: AK Peters, 2008.

[8] P. Ekman and W. V. Friesen, Manual for the Facial Action CodingSystem. Palo Alto, CA: Consulting Psychologist Press, 1977.

[9] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. H. Salesin,“Synthesizing realistic facial expressions from photographs,” in Proc.SIGGRAPH, 1998, pp. 75–84.

[10] I. Buck, A. Finkelstein, C. Jacobs, A. Klein, D. H. Salesin, J. Seims,R. Szeliski, and K. Toyama, “Performance-driven hand-drawn anima-tion,” in Proc. NPAR, 2000, pp. 101–108.

[11] L. Zhang, N. Snavely, B. Curless, and S. M. Seitz, “Spacetime faces:High resolution capture for modeling and animation,” ACM Trans.Graph., vol. 23, no. 3, pp. 548–558, 2004.

[12] J.-Y. Noh and U. Neumann, “Expression cloning,” in Proc. SIGGRAPH,2001, pp. 277–288.

[13] H. Pyun, Y. Kim, W. Chae, H. W. Kang, and S. Y. Shin, “An example-based approach for facial expression cloning,” in Proc. SCA, 2003, pp.167–176.

[14] J. Chai, J. Xiao, and J. Hodgins, “Vision-based control of 3-D facialanimation,” in Proc. SCA, 2003, pp. 193–206.

[15] K. Na and M. Jung, “Hierarchical retargetting of fine facial motions,”Comput. Graph. Forum, vol. 23, no. 3, pp. 687–695, 2004.

[16] R. W. Sumner and J. Popovic, “Deformation transfer for trianglemeshes,” ACM Trans. Graph., vol. 23, no. 3, pp. 399–405, 2004.

[17] J. P. Lewis, J. Mooser, Z. Deng, and U. Neumann, “Reducing blendshapeinterference by selected motion attenuation,” in Proc. I3D, 2005, pp. 25–29.

[18] Z. Deng, P.-Y. Chiang, P. Fox, and U. Neumann, “Animating blendshapefaces by cross-mapping motion capture data,” in Proc. I3D, 2006, pp.43–48.

[19] B. Choe, H. Lee, and H.-S. Ko, “Performance-driven muscle-based facialanimation,” J. Visualization Comput. Animation, vol. 12, no. 2, pp. 67–79, May 2001.

[20] E. Sifakis, I. Neverov, and R. Fedkiw, “Automatic determination of facialmuscle activations from sparse motion capture marker data,” ACM Trans.Graph., vol. 24, no. 3, pp. 417–425, 2005.

[21] Z. Deng and U. Neumann, “eFASE: Expressive facial animation syn-thesis and editing with phoneme-isomap control,” in Proc. SCA, 2006,pp. 251–259.

[22] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3-Dfaces,” in Proc. SIGGRAPH, 1999, pp. 187–194.

[23] E. S. Chuang, H. Deshpande, and C. Bregler, “Facial expression spacelearning,” in Proc. Pacific Graph., 2002, pp. 68–76.

[24] Y. Wang, X. Huang, C.-S. Lee, S. Zhang, Z. Li, D. Samaras, D. Metaxas,A. Elgammal, and P. Huang, “High resolution acquisition, learning andtransfer of dynamic 3-D facial expressions,” Comput. Graph. Forum,vol. 23, no. 3, pp. 677–686, 2004.

[25] D. Vlasic, M. Brand, H. Pfister, and J. Popovic, “Face transfer withmultilinear models,” ACM Trans. Graph., vol. 24, no. 3, pp. 426–433,2005.

[26] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: Driving visualspeech with audio,” in Proc. SIGGRAPH, 1997, pp. 353–360.

[27] M. Brand, “Voice puppetry,” in Proc. SIGGRAPH, 1999, pp. 21–28.[28] E. Chuang and C. Bregler, “Mood swings: Expressive speech animation,”

ACM Trans. Graph., vol. 24, no. 2, pp. 331–347, 2005.[29] E. Sifakis, A. Selle, A. Robinson-Mosher, and R. Fedkiw, “Simulating

speech with a physics-based facial muscle model,” in Proc. SCA, 2006,pp. 261–270.

[30] Y. Cao, P. Faloutsos, E. Kohler, and F. Pighin, “Real-time speech motionsynthesis from recorded motions,” in Proc. SCA, 2004, pp. 345–353.

[31] T. Ezzat, G. Geiger, and T. Poggio, “Trainable videorealistic speechanimation,” ACM Trans. Graph., vol. 21, no. 3, pp. 388–398,2002.

[32] Y.-J. Chang and T. Ezzat, “Transferable videorealistic speech animation,”in Proc. SCA, 2005, pp. 143–151.

[33] Z. Deng, U. Neumann, J. P. Lewis, T.-Y. Kim, M. Bulut, andS. Narayanan, “Expressive facial animation synthesis by learningspeech coarticulation and expression spaces,” IEEE Trans. Visual-ization Comput. Graph., vol. 12, no. 6, pp. 1523–1534, Nov.–Dec.2006.

[34] I.-J. Kim and H.-S. Ko, “3D lip-synch generation with data-faithfulmachine learning,” Comput. Graph. Forum, vol. 26, no. 3, pp. 295–301,2007.

[35] K. Wampler, D. Sasaki, L. Zhang, and Z. Popovic, “Dynamic, expressivespeech animation from a single mesh,” in Proc. SCA, 2007, pp. 53–62.

[36] A. Lofqvist, “Speech as audible gestures,” in Speech Production andSpeech Modeling. Berlin, Germany: Springer, 1990, pp. 289–322.

[37] M. M. Cohen and D. W. Massaro, “Modeling coarticulation in syntheticvisual speech,” in Proc. CA, 1993, pp. 139–156.

[38] P. Cosi, E. M. Caldognetto, G. Perin, and C. Zmarich, “Labial coarticu-lation modeling for realistic facial animation,” in Proc. ICMI, 2002, pp.505–510.

[39] S. Kshirsagar and N. Magnenat-Thalmann, “Visyllable based speechanimation,” Comput. Graph. Forum, vol. 22, no. 3, pp. 632–640,2003.

[40] B. J. Frey and D. Dueck, “Clustering by passing messages between datapoints,” Science, vol. 315, no. 5814, pp. 972–976, 2007.

[41] X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee, and R. Rosen-feld, “The SPHINX-II speech recognition system: An overview,” Com-put. Speech Language, vol. 7, no. 2, pp. 137–148, 1993.

[42] Y. Zhang, Q. Ji, Z. Zhu, and B. Yi, “Dynamic facial expression analysisand synthesis with MPEG-4 facial animation parameters,” IEEE Trans.Circuits Syst. Video Technol., vol. 18, no. 10, pp. 1383–1396, Oct.2008.

Yu-Mei Chen received the B.S. and M.S. degreesin information management and computer scienceand information engineering from National TaiwanUniversity, Taipei, Taiwan, in 2008 and 2010, re-spectively.

She is currently with National Taiwan University.Her current research interests include computer an-imation.

Fu-Chun Huang received the B.S. and M.S. degreesin information management from National TaiwanUniversity, Taipei, Taiwan, in 2005 and 2007, re-spectively. He is currently pursuing the Ph.D. de-gree with the Department of Electrical Engineeringand Computer Sciences, University of California,Berkeley.

His current research interests include computergraphics.

CHEN et al.: ANIMATING LIP-SYNC CHARACTERS WITH DOMINATED ANIMEME MODELS 1353

Shuen-Huei Guan received the B.S. and M.S.degrees in computer science and information en-gineering from National Taiwan University, Taipei,Taiwan, in 2002 and 2004, respectively. He is cur-rently pursuing the Ph.D. degree with the GraduateInstitute of Networking and Multimedia, NationalTaiwan University.

He is currently a Research and Development Man-ager with Digimax, Inc., Taipei. His current researchinterests include computer graphics, stereoscopicvision, and computer animation.

Bing-Yu Chen (S’02–M’03–SM’12) received theB.S. and M.S. degrees in computer science andinformation engineering from National Taiwan Uni-versity, Taipei, Taiwan, in 1995 and 1997, respec-tively, and the Ph.D. degree in information sciencefrom the University of Tokyo, Tokyo, Japan, in2003.

He is currently a Professor with National TaiwanUniversity. His current research interests includecomputer graphics, image and video processing, andhuman–computer interaction.

Dr. Chen is a Senior Member of ACM and a member of Eurographics. Hehas been the Secretary of Taipei ACM SIGGRAPH since 2010.