52
Bachelor of Science in Digital Game Development June 2022 Performance benchmarks of lip-sync scripting in Maya using speech recognition Gender bias and speech recognition Adrian Björkholm Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

Performance benchmarks of lip-sync scripting in Maya using

Embed Size (px)

Citation preview

Bachelor of Science in Digital Game DevelopmentJune 2022

Performance benchmarks of lip-syncscripting in Maya using speech

recognitionGender bias and speech recognition

Adrian Björkholm

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partialfulfilment of the requirements for the degree of Bachelor of Science in Digital Game Development.The thesis is equivalent to 10 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not usedany sources other than those listed in the bibliography and identified as references. They furtherdeclare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:Author(s):Adrian BjörkholmE-mail: [email protected]

University advisor: Associate Professor Veronica SundstedtDepartment of Computer Science

Faculty of Computing Internet : www.bth.seBlekinge Institute of Technology Phone : +46 455 38 50 00SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

Abstract

Background. Automated lip sync is used in animation to make facial animationswith a minimal interception from an animator. A lip-syncing script for Maya hasbeen written in Python using the Vosk API to transcribe voice lines from audio filesinto instructions in Maya to automate the pipeline for speech animations. Previousstudies have mentioned that some voice transcription and voice recognition API’shave had a gender bias that does not read female voices as efficiently as male voices.Does gender affect this lip-syncing script’s performance in creating animations?Objectives. Benchmark the performance of a lip-syncing script that uses voicetranscription by looking for a gender bias in a voice transcription API by comparingmale and female voices as input. If there is a gender bias, how much does it affectthe produced animations?Methods. Evaluating the script’s perceived performance by conducting a user studythrough a questionnaire. The Participants evaluate different animation attributes tobuild an image of a potentially perceived gender bias in the script. Analyzing thetranscribed voice lines for an objective view of a possible gender bias.Results. The transcribed voice lines were almost perfect on both male and femalevocal lines, with just one transcription error for one word in one of the male voicedlines. The male and female voiced lines received very similar grading on their voicelines when analyzing the data from the questionnaire. On average, the male voicelines seemed to get a higher rating on most voice lines in the different criteria, butthe score difference was minimal.Conclusions. There is no gender bias in the lip syncing script. The accuracyexperiment had a very similar accuracy rate between the male and female vocallines. The female-voiced lines received a slightly higher accuracy than the male voicelines with the difference in one word. The male voice lines received a slightly higherscore on the perceived scores through the questionnaire. The males had a higherscore because of other factors than a possible gender bias.

Keywords: Lip-syncing, speech recognition, Animation, Maya, Python

Contents

Abstract i

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Research questions . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.3 Research hypothesis: Gender bias . . . . . . . . . . . . . . . . 31.2.4 Null hypothesis: Gender bias . . . . . . . . . . . . . . . . . . 3

1.3 Sustainability aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Ethical and societal aspects . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 52.1 Gender bias in voice recognition . . . . . . . . . . . . . . . . . . . . . 52.2 Solution without machine learning . . . . . . . . . . . . . . . . . . . . 52.3 Machine learning-based approaches . . . . . . . . . . . . . . . . . . . 62.4 Decision of using inspiration from older research and apply it to the

script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Method 73.1 Research methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Implementation of script . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Voice recognition API . . . . . . . . . . . . . . . . . . . . . . 83.2.2 Timeline calculation . . . . . . . . . . . . . . . . . . . . . . . 83.2.3 Phoneme recognition and viseme toggling . . . . . . . . . . . . 93.2.4 Applying timeline data to the Maya scene . . . . . . . . . . . 10

3.3 Questionnaire user study . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.1 Random order limitation . . . . . . . . . . . . . . . . . . . . . 143.3.2 Other limitations . . . . . . . . . . . . . . . . . . . . . . . . . 143.3.3 Configuration and preferences . . . . . . . . . . . . . . . . . . 14

3.4 Accuracy in voice transcription experiment . . . . . . . . . . . . . . . 16

4 Results and Analysis 174.1 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Questionnaire demographic . . . . . . . . . . . . . . . . . . . 254.1.2 Results and analysis of the questionnaire . . . . . . . . . . . . 25

4.2 Accuracy experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

iii

5 Discussion 295.1 Hypothesis vs. results . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Discussing score gap reason . . . . . . . . . . . . . . . . . . . . . . . 295.3 Participation feedback . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3.1 Poor blendshapes . . . . . . . . . . . . . . . . . . . . . . . . . 315.3.2 Longer sentences and co-articulation . . . . . . . . . . . . . . 32

5.4 Script based accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Conclusions and future work 336.1 Concluded summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2 Future work suggestions based on RQ answers . . . . . . . . . . . . . 346.3 Components to improve . . . . . . . . . . . . . . . . . . . . . . . . . 346.4 Ideal testing environment . . . . . . . . . . . . . . . . . . . . . . . . . 34

References 37

A Supplemental Information 39

iv

Chapter 1

Introduction

The introduction chapter will discuss the purpose of using a lip-syncing script, thethesis scope, research questions, and what objectives need to be completed to answerthe research questions. Two hypotheses about the results and a short discussion ofethics and sustainability aspects are also mentioned.

1.1 Background

Animation in video games and other forms of media is time constraining but isa must for an immersive experience. It is sometimes expensive to animate shortsections of a game or a movie. It is beneficial for the production to take shortcutsand use helper tools to save money and production time. Lip-syncing is a techniquewhere the voice data of an audio file gets analyzed by, e.g., animation software andanimates lip movements for the animator to save time. Lip-syncing can be a tedioustask that takes time and is expensive. Some animation software uses lip-syncingengines to save time by analyzing an audio file and converting the data to animationinstructions. The audio file gets scanned for each pronounced phonemes timestamp.The result can help the animator save time. A phoneme is a unique sound/tonethe voice makes. For example, the words "sitting" and "site" both use the letter"I" but they sound different from each other. That is because they have differentphonemes [14].

Automated lip-syncing in games and computer-generated images is not a newtechnique. There have been lip-syncing pipelines created before when audio fileshave been analyzed for phonemes and given instructions for what viseme shouldbe used [19]. Visemes are sounds that share the same mouth shape as more thanone phoneme [20]. A lot of modern lip-syncing pipelines use machine learning torecognize what phoneme is said and to calculate appropriate visemes [15] [6]. A fewstudies about voice recognition APIs and transcribing audio files mentioned that theprograms have more challenges with understanding females than males [13] [10].

1.2 Scope

This thesis aims to benchmark the performance of a lip-syncing script created for themodeling software Maya. The first step is to create a Python script connected with aspeech transcription API that translates voice data in an audio file to instructions inMaya to make a face model mimic its lips to the audio file. Secondly, two experiments

1

2 Chapter 1. Introduction

will be conducted. One is when the script will count correctly transcribed words inthe audio file and see how impactful the potential transcription errors are. The secondexperiment will evaluate the produced animation’s appeal, realism, and accuracyusing a questionnaire.

1.2.1 Research questions

Previous work mentions that gender has also influenced accuracy in voice recogni-tion [13] [10]. The research questions will determine a ground truth for the existenceof a gender bias in the voice transcription component in the script. The researchquestions will also be used to see how visible the potential gender bias is by mea-suring the perceived level of appeal, realism, and accuracy in the created animations.

RQ1: How accurate is phoneme detection using a script-based approach com-paring male and female identical voice lines?

RQ2: What is the perceived appeal, realism, and accuracy in lip animationsbetween male and female characters and voice line complexity?

1.2.2 Objectives

• Create a lip-syncing script in Maya to detect phonemes in an audio file andmap out when they appear in a timeline. Transfer the information to Mayasanimation timeline and make a mesh toggle between blend shapes for eachkeyframe based on what phoneme got found at each keyframe.

• Write and record voice lines with a male and female voice to create materialto benchmark the script.

• Find a mesh representing a face to animate blend shapes representing differentvisemes.

• Animate the visemes.

• Run the lip syncing script and export the produced animations to a videoformat.

• Experiment by counting the transcribed words and comparing the transcrip-tions to the script. Evaluate if potential transcription errors would cause anincorrect animation that is visible.

• Create an experimental design where test participants evaluate the realism andappeal of the animations on a 1 to 5 scale in a questionnaire.

• Conduct a pilot form study with the questionnaire.

• Use feedback to improve the form study with the questionnaire.

• Conduct an actual form study with the questionnaire.

1.3. Sustainability aspects 3

• Analyze the questionnaire data. If the male and female version of the same voiceline has a similar realism and appeal rating or if the male and female animationshave a higher realism and appeal score than the other one to determine if thescript has a different performance depending on the voice actors gender.

• Analyze and check if the average appeal and realism score is high overall todetermine how effective the script is.

1.2.3 Research hypothesis: Gender bias

When measuring the voice recognition components, accuracy between the male andfemale voiced lines. The female-voiced lines will have a lower amount of correctlytranscribed words, which will lead to less accurate phonetic conversion and create amore out-of-sync animation. Poor transcription will lead to a lower grading of thequestionnaire’s female voice clips.

1.2.4 Null hypothesis: Gender bias

There will be no difference in the number of words detected by the lip-syncing scripton the voice recognition component. Therefore there is no noticeable quality gapbetween the male and female versions of the animations. The accuracy, realism, andappeal scores in the questionnaire are identical.

1.3 Sustainability aspectsThe script automates tasks that are supposed to be done by hand. The script willfigure out what phoneme the voice makes for each keyframe. The script will pick acorrect viseme on the mesh for each phoneme in the timeline. The script does thesetasks instead of the animator and saves time.

1.4 Ethical and societal aspectsThe study will use volunteering participants to answer some questions in a question-naire and follows an internal review checklist from the Ethical Advisory Board insoutheast Sweden. Participants will not get asked sensitive questions about them-selves. The participants are all eighteen or older. Participation is voluntary, andthe study will not contain any physical or medical procedures. There is no risk orintention for physical or psychological harm in the study. The participation is alsoanonymous and can therefore not be linked to any specific individuals.

Chapter 2

Related Work

This chapter will mention research that implies and claims the existence of genderbias in voice recognition research. Afterward, introduce research on a lip-syncingsolution that inspired this lip-syncing thesis script and, in the end, a few more modernlip-syncing scripts.

2.1 Gender bias in voice recognition

The lip-syncing script used in this thesis will use a form of voice recognition as aninput for the data that will be managed in the animations. In a study where a drawingsoftware used voice as controller input, one participant could not be part of the testbecause the software could not support his accent. A pattern during the experimentalso mentioned that female participants had a hard time being understood by thevoice recognition component in the software [10]. In another study, when a forensicstranscription tool was created for transcribing data that could be evidence on phones,multiple voice recognition tools were used for different situations. It was mentionedthat Mozilla’s deep speech software still today has a bias in understanding nativeEnglish-speaking males more than females [13]. Another study that did not mentiongender bias but another voice recognition problem created a simple game that usedvoice input as a controller. However, the word used for calling some features had toreplace the word that created the callback because the voice recognition software hadproblems understanding that word [14]. Because multiple studies have mentionedstruggles with voice input in voice recognition, it is essential to ensure that thescript’s voice recognition input is accurate for male and female speakers.

2.2 Solution without machine learning

One way to create a lip-syncing program and scan for phonemes without transcrib-ing text or to use machine learning is by looking for diphones. Diphones are audioclips of voice data, usually in the middle of two phonemes. To find phonemes in theaudio, afterward, command the program what viseme is suitable at what moment.This lip-sync solution uses blend shapes that an animator creates beforehand for eachviseme for every detected phoneme. Instructions for showing a matching viseme inthe animation at the same time frame are created. A blend shape is a mesh withmodified vertices usually based on a slider between 0 and 1. When the slider equals0, the mesh will have its default shape, and if the slider equals 1, it will have a

5

6 Chapter 2. Related Work

modified appearance. This is an alternative method for animating geometry. It isalso mentioned that if a lip-syncing script interpolates between different blend shapevisemes, the animation can be unnatural because this will not be a more concise co-articulation. A method to tackle this problem is using animation curves that makeanimation speeds less linear. Animation curves can be used to limit the animationspeed for specific movements and make co-articulation look more natural [19]. Nat-ural co-articulation is something that this script has not considered and can cause alower quality score on more advanced voice lines.

2.3 Machine learning-based approachesOne machine learning-based lip-sync study uses a recurrent neural network withlong-short-term memory and feeds the neural network with footage of, in this case,Barack Obama talking. The neural network tracks Obama’s lip movement footageand converts the lip positions to vertex data. This solution is more similar to deepfake video production than creating avatars. The animation gets applied to an imageplane with a face that resembles Obama. When an audio file gets imported to thesolution, a video gets rendered that tries to mimic the audio files based on howObama’s lips moved when he said similar things. This solution could use voice clipsof other people and make the Obama footage mimic the speaking person’s voice [15].Another study uses a similar approach by having a neural network feed audio andvideo of a man talking. The neural network cross-references the spoken audio withthe visual visemes in the video of the man speaking. After inputting a targetedaudio dialogue in the program, the program can output what visemes should beapplied as blend shapes on an avatar. This solution creates not only lip animationbut facial movement too [16]. Applying facial movement via machine learning hasalso been used to create even more advanced animations [11]. Data was being fedwas of a person talking in different emotional states as being happy, sad, and angryby profiling the different emotions into speaking styles through analyzing the talkstone and word phasing [11].

2.4 Decision of using inspiration from older researchand apply it to the script

Most modern lip syncing research follows a machine learning-based approach, butmaking a program with machine learning and facial tracking itself is a very advancedtechnology for a 10-week bachelor thesis. A solution without machine learning andfacial tracking is a more suitable process. This makes the older non-machine learning-based approach better in this scenario.

Chapter 3

Method

The method chapter starts by explaining the choice of methods used to answer theresearch questions and why they are considered more practical to answer the questionrather than alternatives. Afterward, explaining how the script was created and whyspecific designs were chosen gets explained. In the end of the chapter, a deeperexplanation of how the methods will get answered by mentioning the limitations ofthe methods.

3.1 Research methods

To answer RQ1, an accuracy experiment is conducted by analyzing the transcribeddata. Analyzing and comparing the written voice line script is probably the most effi-cient method to see how accurate the voice transcription component is. Furthermore,take a deeper look to see how the phoneme instructions change if the transcriptionfails somewhere.

RQ2 gets answered by conducting a user study through a questionnaire online.Using a questionnaire means that the participants are not limited by their physi-cal location. The user can participate as long as they have an internet connection.One disadvantage is that their experimentation area is not fully controlled. Depend-ing on internet speed and computer and phone quality, their perceived view of theanimations could vary due to this factor.

Both research questions and methods are very co-dependant on creating sustain-able data. The method for RQ1 is the most critical component to prove if there is agender bias in the program because when the data has been transcribed, there is nogender bias that can cause worse performance outside of the transcription. If therewas a gender bias discovered during the RQ1’s method, RQ2’s method could provehow much impact the gender bias has on the produced animations. However, evenif there would be a difference according to RQ2’s method in the perceived qualitybetween the male and female voiced lines, RQ1’s method will be used to prove thata gender bias is the factor of a supposed quality gap between the male and femalevoiced lines.

RQ1 could also have been answered by creating a critical article review by com-paring research of different voice recognition API gender biases. This could create theresearch questions scope broader by summarising different voice recognition compo-nents. However, not benchmarking the transcription accuracy would weaken genderbias claims because perception is subjective.

7

8 Chapter 3. Method

An alternative method for answering RQ2 could be to conduct an interview withparticipants and ask them their opinion about the animations. This method couldgive a much more in-depth understanding of the quality of the animation. However,the disadvantages would make it more time-consuming to gather participants, leadingto lower participation in the study and potentially lower the spread of views on theanimations.

3.2 Implementation of script

The following subsections are parts of how the script was created.

3.2.1 Voice recognition API

The script is meant to read audio files and transcribe instructions to the modelingsoftware Maya 2022 to toggle between blend shapes that present different visemesthat mimic how the voice talks. Two suitable methods could be transcribing au-dio files directly to phonemes or transcribing the audio to text and afterward tophoneme data. On the documentation for the tool "pocketsphinx" that can be usedfor phoneme recognition is mentioned that the transcribe results could be disappoint-ing and that voice recognition is heavily reliant on context from word recognition [7].Dmytro Nikolaiev (Dimid) has posted a solution on [9] where he shows how to re-ceive a transcribed list of all the said words from an audio file. The words containtimestamps of how far in the audio file the word starts and ends using the voicerecognition package called "vosk" that has support for Python version 3.7.7 [3] whois the default Python version Maya 2022 is running. Therefore using "vosk" to tran-scribe audio to text seems the most problem-free method to collect the voice data.A voice recognition model is needed to be loaded that contains instructions to reada specific language, and this script uses "vosk-model-en-us-0.22" from [4].

3.2.2 Timeline calculation

A list gets created with the length of the number of the imported audio file inseconds multiplied by the animations frame rate or keyframes per second. The listis created to store all keyframe instructions somewhere before transferring the datato the animation timeline in Maya. When the keyframe timeline is just created, itis a list filled with chars containing a ’.’ for each keyframe. The char ’.’ means "donothing" when the script reads the list and will look like the following Table 3.1. Thisfigure represents a second-long audio file with silence if the animation’s frame rateis 24 keyframes per second. When adding a word to the timeline, a few variables areneeded to calculate how the timeline gets shaped: Keyframes per second(kps), thenumber of chars in the word, or "char count" (cc) for short. The words start time(st) and end time (et) in the audio file with the words st and et, the duration(d) canbe calculated et - st = d. When the word duration has been calculated, the programcalculates the average number of seconds to say each character per second that alsocan be called seconds per char(spc). When the duration of the word and the gapbetween chars have been calculated, it is time to calculate where each character in

3.2. Implementation of script 9

. . . . . . . . . . . . . . . . . . . . . . . .

Table 3.1: Illustration of the phoneme data in the animation timeline withoutphoneme data. Each character represents pronounced phoneme data in each frame.The dots mean that a new sound was not pronounced at the time frame.

. . . . . . H . . E . . L . . L . . O . . . . .

Table 3.2: List timeline containing instructions for saying the word "hello". Eachletter-character is representing the pronounced phoneme at the keyframe.

the world will be placed on the timeline. If the word "hello" would be calculated tothe timeline, you have to number each char in the string from 0 to the last letter, e.g.Table 3.3. The char number is called char index(ci). To calculate where one charin the timeline is placed looks the following: ( st + ci * spc ) * kps ≈ keyframe.Table 3.4 is a list of all the acronyms for the math listed and their full name, andTable 3.5 shows the keyframe calculations directly without words. The value needsto be rounded because the timeline contains integers, but the keyframe placementis more accurate depending on the framerate or the kps. If an audio clip with thelength of one second where a voice is only saying hello, the word hello starts at 0.25seconds and ends at 0.9 seconds after the audio clip starts, the keyframe timelineshould look like the following if the framerate is 24 kps as seen in Table 3.2. A secondlist object in python is created from the list of words transcribed to keep track of themoment when the character is supposed to keep its mouth closed. The list logs thetime between when a word ends and when the next word starts in the transcription.The final word created a timestamp between the end of the word and the end of thetimeline to ensure the character has its mouth closed when the dialogue is over, andthe same thing is done between the start of the timeline and the first word. This listkeeps track of all the moments there are in the silence in the animation.

3.2.3 Phoneme recognition and viseme toggling

Because the script manages the text Phonetics through the IPA alphabet, the hellowould look like this instead: "h@lo." Phonetics is used to prevent the "I" in "Sitting"and "Site" sounds different but are the same character problem. When readingsymbols from the timeline because they are different phonemes and therefore is thePython package used "eng to IPA" used [17] to convert the voice transcriptionsto phonetics. The script needs to know how many visemes the animation will beable to toggle between and what phonemes will trigger what viseme. This lip-syncscript will follow a table from the Oculus documentation [12]. This table splits all thephonemes between twelve unique visemes. To save time in setting up the project, the

index 0 1 2 3 4char H E L L 0

Table 3.3: A string containing the text "Hello". The indexes explains what characteris in each slot of the string.

10 Chapter 3. Method

Keyframes per second kpscharcount ccstartTime stendTime etduration dseconds Per Char spcchar index ci

Table 3.4: Explanations of what all the acronyms are in full text. These variablesare used when calculating time slots for each keyframe and the acronyms full text.

et - st = dd / cc = spc( st + ci *spc ) * kps ≈ keyframe

Table 3.5: The mathematical formula used to calculate where each pronouncedphoneme appears in the timeline.

produced animations in the thesis will be using a mesh that has already been createdbefore and is free to use. The face mesh in this project is created by Wael Tsar [18]Figure 3.1, and the teeth mesh is created by Alexander Antipov [5] Figure 3.2. Bothmeshes used are under creative commons license 4.0 [8]. The face and teeth meshesdo not have any blend shapes created for expressing any visemes and are thereforecreated in the study based on the previously mentioned viseme table [12]. Whenbinding the phonetic symbols to the different visemes, a table containing all thesymbols with their HTML code was used to list every possible phonetic symbol theprogram could have created [1]. The list also contained what alphabetic letter fromthe standard text each phonetic is equal to in a few moments. When it was not clearwhat phoneme a symbol was equal to, an additional table was used that pronouncedevery symbol [2].

3.2.4 Applying timeline data to the Maya scene

When the script starts, it will use the extension pymel, the communication bridgebetween Maya and Python, to list the scene for all its blend shape nodes and addthem to a list. The user will have access to a drop-down menu for each viseme to selectwhat blend shape is equal to what viseme and also select a directory path to the audiofile the script will transcribe and lip-sync Figure 3.3. The program will go throughthe entire timeline list, and whenever the indexed char is something other than ’.’, allblend shapes will have a keyframe marked when it says that their slider value is equalto 0 to prevent the two blend shapes combines on a keyframe. Afterward, the programwill check precisely what phoneme the symbol is equal to for the viseme to representand give the blend shape’s slider value equal to 1 as a keyframe on that timelineindex. When all the dialogue movement is added, it is time for some polishing onthe animation. The previous list of mentioned timestamped silenced moments willbe checked if their duration is longer than a specific length. The length is configuredby the script used and is in the animations produced in this thesis. The length is

3.2. Implementation of script 11

Figure 3.1: Face mesh used for creating animations.

12 Chapter 3. Method

Figure 3.2: Teeth mesh used for creating animations.

configured to 5/24 seconds. If the silence is 5/24 seconds or longer, keyframes will beadded that make the character keep its mouth closed between the words to preventthe mouth from staying open or make slow lip movements between words that mightlook abnormal.

3.3 Questionnaire user study

The user study conducted to evaluate the script’s performance in creating accurate,appealing, and realistic animations was done through Google forms. The voice linesare split into three different advancement levels, "Basic words" (bw) line 1-7, "Basicsentences" (bs) line 8-11, and "Advanced sentences" (as) line 12-15 and, 17. Thesethree advancement groups help evaluate the script’s lip-syncing at different speakinglevels. Bw is recorded voice lines just containing one word, bs is also simple linessaying a series of words and is not intended to emulate a real dialogue as are fa-mous quotes from people and books and used to emulate a more realistic dialogue.The unique voice lines can be seen in Table 3.7. The participants will see a maleand female voiced animation of every voice line and rate the accuracy, realism, andappeal on a scale between one to five. To avoid scenarios where the number 4, forexample, means a different thing for the participants and researcher(s), instructionswere created to explain what the value of 1, 3, and 5 is equal to, see Table 3.6 forrating instructions. Voice line 16 got rejected to have an even amount of voice linesin the experiments.

3.3. Questionnaire user study 13

Figure 3.3: The lipsyncing scripts UI.

14 Chapter 3. Method

3.3.1 Random order limitation

Every video that the participants evaluated had a unique section/page instead ofhaving every video on the same page to reduce loading time and lag/stutter. Givingeach video an individual section also prevents the participant from re-voting onolder videos when they have already evaluated a video because the animations arenot meant to be compared with each other directly. The animations order of theanimations is intended to be randomized to avoid voting based on a pattern. Googleforms do not support randomized sections. Suppose the content is randomized insidea section. In that case, no control ensures that the proper rating input is next to thecorrect video because each question and video is a unique block with a randomizedplacement. Because of this limitation, a randomized order that is the same foreveryone is created. The clips are played in the following shown in Table 3.8 in thequestionnaire. "M" stands for male and "F" for female. The number is linked towhat the voice line is saying. The number next to # is the voice line order, and thefollowing number represents the voice line number.

3.3.2 Other limitations

A bug in the script was discovered that the English to IPA package added apostrophesand commas to the transcribed words, which caused the script’s animation timelinelist to give instructions for some keyframes where the animation could not apply anew viseme because the char was undefined. On some computer screens with highresolution, the videos could appear small for the participant, and Google forms donot support full-screen for embedded videos. To improve participants’ quality of thevideos, instructions were added to make the video’s screen size larger in the browseron a desktop computer/laptop. "Tip: If the animated videos on the screen are toosmall, you can zoom in by pressing the Ctrl key and scrolling the mouse wheel toenlarge the video." To keep up with ethical standards, the user study only allowsparticipants to be the age of 18 or older, which is mentioned before the study starts.To participate in the user study, the participants need to mark a checkbox thatsays 18 or older. However, because the study is online and anonymous, there is nophysical control to verify that the participants are 18 years old. The user study onlygot shared around on places on the web targeted to an 18 or older audience, e.g.,Discord servers associated with Blekinge institute of technology, student activities,and public Discord servers about working with game development. The form alsogot shared on subreddits about game development.

3.3.3 Configuration and preferences

The videos used for testing had a screen resolution of 1920 x 1080 pixels and wereexported as the quick time player supported video format ".mov" and encoded as"H.261" with the quality of 100. The videos were not ray trace rendered but directlyexported from Maya with the playblast feature. The animations are phased with 24keyframes per second, and the silence between the words that make the characterclose its mouth must be 5/24 seconds long, and it will take 3/24 seconds for thecharacter to close and open its mouth again when it starts talking again.

3.3. Questionnaire user study 15

Accuracy Rating1 The mouth does not match the recorded voice3 The mouth opens and closes to the recorded voice5 The mouth’s lips, tongue, and teeth match the recorded voice well

Realism Rating1 The mouth movements don’t look human3 The mouth movements remind me of human movements5 The mouth movements look humanlike

Appeal1 The overall animations are poor looking3 The overall animations are neither poor nor look good5 The overall animations look good

Table 3.6: The instructions given to the participants on how to grade the animations.

# Voice line misc1 Alpha Basic word2 Beta Basic word3 Gamma Basic word4 Breakfast Basic word5 Lunch Basic word6 Supper Basic word7 Gluttony Basic word8 Independence Basic sentence9 Alpha, Beta, Gamma Basic sentence10 Breakfast, Lunch, Supper Basic sentence11 Hello my name is Alex Basic sentence12 Whoever is happy will make others happy too Advanced sentence13 I would rather die of passion than of boredom Advanced sentence14 You miss one hundred percent of the shots you don’t take Advanced sentence15 Life is really simple, but we insist on making it complicated Advanced sentence

16 What counts can’t always be counted;what can be counted doesn’t always count (REJECTED!)

17 Give a man a fish and you feed him for a day.Teach a man to fish and you feed him for a lifetime Advanced sentence

Table 3.7: List of all the voice lines used in the animations for the user study andthe accuracy test.

#1: f14 #2: m5 #3: m1 #4: f7 #5: f12 #6: f11 #7: m17 #8: m13#9: m2 #10: m4 #11: f5 #12: f17 #13: f1 #14: f4 #15: m11 #16: m10#17: m3 #18: m12 #19: m6 #20: f6 #21: f15 #22: m9 #23: m7 #24: f2#25: m8 #26: f10 #27: f13 #28: f9 #29: m14 #30: m15 #31: f3 #32: f8

Table 3.8: Order of the presented clips in the user study.

16 Chapter 3. Method

3.4 Accuracy in voice transcription experimentThe questionnaire asks the participants how accurate the lip-sync is on the anima-tions, if the lip-sync accuracy appears low on the voice clips or if there is a perceivedgender bias in the quality between the male and female voiced animations. Thetranscribed data needs to be analyzed and measured to see if the voice transcriptioncomponent produces data of lower quality for the script to work with, depending onwho the speaker is. First, all the voice lines will get transcribed into readable text.The number of correctly transcribed words will be compared to the written scriptfor each voice line Table 3.7 to see what words the transcription component failedto understand. Suppose a voice line did turn out to be wrongly transcribed. In thatcase, the text produced and the original written line gets converted to Phonetics tosee how different the phonetics would be from its counterparts and see if phoneticsmade visemes who do not appear to say the intended words appear. An additionaltest will be conducted on the text by converting all transcribed voice lines to pho-netics. Afterward, check if the exact amount and order of undefined characters aredetected between the male and female voiced lines.

Chapter 4

Results and Analysis

Here are the results of the conducted experiments presented. Pie charts presentthe participant’s demographic pool, and tables show how to view the participants’perceived scores and the transcription performance. It appears to not exist a genderbias after analyzing the data.

Figure 4.1: Participants gender.

4.1 Questionnaire

The following subsections showcase the results from the user study conducted to mapout the perceived gender bias to answer the research question about the perceivedaccuracy, realism, and appeal.

Figure 4.2: Participants age.

17

18 Chapter 4. Results and Analysis

Figure 4.3: Participants video game habits.

Figure 4.4: Participants consumption of 3D animation content habits.

bw basic word (Line 1 to 8)bs Basic Sentence (Line 9 to 11)as Advanced sentence (Line 12 to 17 except for 16)atl All the lines (1 to 17 except for 16)

Table 4.1: Clarification of all the voice line advancement levels accronyms.

4.1. Questionnaire 19

Accuracy Realism Appeal Accuracy Realism AppealF1 3.595 3.486 3.270 M1 3.432 2.865 2.919F2 3.838 3.703 3.432 M2 4.243 3.919 3.838F3 4.243 4.081 4.000 M3 4.568 4.378 4.162F4 2.622 2.568 2.324 M4 3.135 2.892 2.892F5 3.135 2.946 2.946 M5 3.108 2.865 2.568F6 4.378 4.270 4.081 M6 4.216 4.108 3.919F7 3.108 2.946 2.838 M7 3.459 3.270 3.216F8 2.811 2.784 2.757 M8 3.378 3.189 3.000F9 4.000 3.811 3.676 M9 4.081 3.784 3.838F10 2.973 2.703 2.838 M10 3.189 3.054 2.946F11 3.459 3.216 3.378 M11 3.838 3.595 3.432F12 3.622 3.027 3.189 M12 3.703 3.216 3.270F13 3.757 3.135 3.324 M13 4.000 3.622 3.541F14 2.946 2.108 2.297 M14 3.514 3.243 3.162F15 3.135 2.622 2.568 M15 3.351 3.000 2.919F17 3.189 2.757 2.784 M17 3.297 2.622 2.811atl 3.426 3.135 3.106 atl 3.657 3.351 3.277

Table 4.2: Questionnaire results. All the unique animations grades as an averagebased of all the participants votes in all the categories.

F Accuracy Realism Appeal M Accuracy Realism Appealbw 3.466 3.348 3.206 1-8 3.693 3.436 3.314bs 3.477 3.243 3.297 9-11 3.703 3.477 3.405as 3.330 2.730 2.832 12-17 3.573 3.141 3.141atl 3.426 3.135 3.106 1-17 3.657 3.351 3.277

Table 4.3: Questionnaire results. All the unique animations grades as an averagebased of all the participants votes in all the categories. Instead of presenting theaverage score of each line this table shows the average score of each advancementlevel and its categories.

Accuracy Realism Appealbw 3.579 3.392 3.260bs 3.590 3.360 3.351as 3.451 2.935 2.986atl 3.541 3.243 3.192

Table 4.4: Average scores in all the advancement levels, both male and female voicedlines are used in this table.

20 Chapter 4. Results and Analysis

Figure 4.5: The difference between male and female voiced lines in all the categories.Red color means that female voiced lines had the highest rating and blue means thatthe male voice line received a higher score in the category.

Figure 4.6: The difference between male and female voiced lines based of their averagecomplexity line score. Red color means that female voiced lines had the highest ratingand blue means that the male voice line received a higher score in the category.

4.1. Questionnaire 21

Correct word count Total word count %F1 1 1 100M1 1 1 100F2 1 1 100M2 1 1 100F3 1 1 100M3 1 1 100F4 1 1 100M4 1 1 100F5 1 1 100M5 1 1 100F6 1 1 100M6 1 1 100F7 1 1 100M7 1 1 100F8 1 1 100M8 1 1 100F9 3 3 100M9 3 3 100F10 3 3 100M10 3 3 100F11 5 5 100M11 5 5 100F12 8 8 100M12 8 8 100F13 9 9 100M13 9 9 100F14 11 11 100M14 11 11 100F15 11 11 100M15 10 11 90.91F17 24 24 100M17 24 24 100

Table 4.5: Voice transcription accuracy.

22 Chapter 4. Results and Analysis

0.000

0.500

1.000

1.500

2.000

2.500

3.000

3.500

4.000

4.500

F Accuracy M Accuracy F Realism M Realism F Appeal M Appeal

Standard deviation

Figure 4.7: Standard deviation based of the male and female voice lines averagescores in each category.

# Transcriptions male voice1 alpha2 beta3 gamma4 breakfast5 lunch6 supper7 gluttony8 independence9 alpha beta gamma10 breakfast lunch supper11 hello my name is alex12 whoever is happy will make others happy too13 i would rather die of passion than of boredom14 you miss one hundred percent of the shots you don’t take15 life is really simple but we insist on making a complicated

17 give a man a fish and you feed him for a day teach a manto fish and you feed him for a lifetime

Table 4.6: Male voice transcriptions.

4.1. Questionnaire 23

Figure 4.8: Transcribed voice lines converted to phonetics.

24 Chapter 4. Results and Analysis

Figure 4.9: Unreadable characters in the voice lines for the timeline reader.

4.1. Questionnaire 25

# Transcriptions female voice1 alpha2 beta3 gamma4 breakfast5 lunch6 supper7 gluttony8 independence9 alpha beta gamma10 breakfast lunch supper11 hello my name is alex12 whoever is happy will make others happy too13 i would rather die of passion than of boredom14 you miss one hundred percent of the shots you don’t take15 life is really simple but we insist on making it complicated

17 give a man a fish and you feed him for a day teach a manto fish and you feed him for a lifetime

Table 4.7: Female voice transcriptions.

4.1.1 Questionnaire demographic

The questionnaire received 37 participants before the test time ended. The par-ticipants were aged between 18 and 44, as shown in Figure 4.2, and more than 90percent of all participants played video games weekly if not more often, as shown inFigure 4.3. More than 85 percent of the participants consumed 3D animated con-tent monthly or more frequently than that, as shown in Figure 4.4 24 participantsidentified as male, 12 participants identified as female, and one person identified asnon-binary/another gender identity, as shown in Figure 4.1.

4.1.2 Results and analysis of the questionnaire

All data was collected and analyzed, and an average score was calculated on everycategory for the clips. It can be seen in Table 4.2. The average score of the dif-ferent advancement levels was also calculated by making an average score for eachadvancement level and all the combined lines. These calculations were both done bycombining the male and female data separately for both genders and can be seen inTable 4.3 and Table 4.4. A difference calculation was created between the male andfemale versions of every clip and the different advancement levels. A table showingthe difference in the average score between the male and female versions of every clipcan be seen in Figure 4.5, and the difference between the male and female versionsin the different advancement levels can also be seen in Figure 4.6. If the square isblue, it means that the male version of the clip had the highest average score in thecategory, and if the square is red, that means the female version of the clip had thehighest score. The number inside the box shows the difference and how higher thegrade was than the opposite gender.

26 Chapter 4. Results and Analysis

When looking at the difference Table 4.5 that shows every voice line, it appearsthat male voice lines have a higher perceived score on the majority of the voice linesand in all of the categories with a small number of exceptions, as in F1, F5, F6 and,F9 (Realism) and F17 (Realism). In most cases, the gap is less than 0.1 and up to0.4, with rare occasions having a higher gap. When looking at Table 4.6 where thedifference between the advancements levels is shown, it appears that the diversityof gaps in the score between the different voice lines evens out. The accuracy gapbetween all the advancement levels appears only to have a difference of 0.2 and mightnot be a noticeable difference. Therefore, the perceived accuracy between males andfemales has coincided very similarly. The realism and appeal seem to receive a highervariance on different levels. This goes against the hypothesis mentioned before. Thedifference in accuracy between male and female voiced lines is always the same, butthe gap between realism and appeal increases depending on the advancement levels.The expected outcome was that the accuracy level would directly affect the level ofrealism and appeal. The difference is still consistently lower than 0.5. It should,therefore, not be rounded as an entire grade point in the gaps between each otherbecause the difference is still only between 0.1 and 0.4 in all the categories despitethe advanced levels, the perceived level of realism and appeal should still be seenas very similar to each other. Figure 4.7 is a chart showing the standard deviationof all the voice line scores between the different categories. The spread and averagevalues are close to each other, as seen in the visualization. Even though the perceiveddifference appears to be barely noticeable between the male and female voiced lines,only half of the male voiced lines should have a higher score ideally. Why so manymale voiced lines had a higher score should be discussed.

Because the perceived difference between male and female voiced lines was slight,Table 4.4 can be used to describe the performance of the script in creating animationsbased on audio files despite the voice actor’s gender. The accuracy of the lip-syncingscript always appears to have a score of 3.5 with a marginal of 0.1. The realismappears to vary between 2.9 and 3.4, and the appeal goes between 3.0 and 3.4. Thequality does decline when the advancement level of the dialogue increases. Withthese scores, it can be determined that the accuracy is high enough for the animatedcharacter to close and open the mouth on the correct occasions, but the visemes mayappear wrong on some occasions. The realism is high enough that the participantsfeel that the character’s movements remind them of humans, but they are not entirelyhuman. The appeal level of the animations is not good, but the animations are notperceived as bad either which is positive. These conclusions were made by comparingthe scores with the grading instructions in Table 3.6.

4.2 Accuracy experiment

When going through the audio files transcriptions, every voice line had 100 % accu-racy except for the male version of voice line 15, as can be seen in Table 4.5. Themale voice had the word "it" as "a" instead, as can be seen in Table 4.6. The femalevoice transcriptions can be seen in Table 4.7, and the voice line script can be seenin Table 3.7. One extra character was added in the transcribed male voice line whenconverting the text to phonetics on line 15. The phonetics in that situation would

4.2. Accuracy experiment 27

still create an almost identical animation because the same or mere identical visemeswould be applied, see Figure 4.8. The accuracy between male and female voices isalmost identical in this lip-syncing script to the voice transcription model used in thistest. It was noticed that the amount, order of and, and character that was identicalbetween the male and female vocal lines. There was one phonetic symbol that thescript could not manage to read in the timeline for what viseme should be presented.It was also noted that phonetic converted voice lines produced apostrophes and com-mas that the program could not do anything with, which can be seen in Figure 4.9.The female voice transcription was better than the male voice transcription but onlya one-word difference. This goes against the gender bias hypothesis.

Chapter 5

Discussion

The following chapter will discuss why there is no gender bias even though the malevoice lines perceived a slightly higher rating in the average scores by comparing theinformation with the actual accuracy from the voice transcription component anddiscuss why this could have occurred. This chapter will also discuss feedback fromparticipants on the animations and what improvements could increase the score.

5.1 Hypothesis vs. results

When comparing the male and female voice lines grades, It appears that the voicetranscription between the male and female voices was almost identical, and the tran-scription even performed better on the female voice lines, as shown in Table 4.5.This goes against the gender bias hypothesis. The male voices did have a slightlyhigher grading on the majority of the animations as in the gender bias hypothesis.However, it was always a lower gap than 0.5, with a few exceptions. Therefore, itwas not too noticeable and is caused by factors other than a potential gender bias inthe voice transcription component. In the gender bias hypothesis, it is claimed thatif there were a grading gap where the female voices had a lower perceived score inthe questionnaire, it would be because the voice transcription would be worse withfemale voices. However, the accuracy experiment proved this not to be the case.Therefore the null hypothesis should be labeled as the truth. However, there is apattern where male voices get the highest score even though the gap is minimal. Itshould be discussed as shown in Table 4.3 and Figure 4.5.

5.2 Discussing score gap reason

The questionnaire was created in Google forms. Randomizing options had limitationsthat would affect the study negatively. Google forms only supported randomization ofthe content inside a section, not by randomizing the section order. This would meanthat all videos would be in the same section and loaded on the page simultaneouslyand cause performance issues on some participants’ computers depending on thespecs. When randomizing the content of a section with the form’s layout, therewould be a high risk that the same video and the videos related questions would notbe next to each other. Because of this, a predetermined random order was created bya number generator. After the questionnaire study, it was noted that almost preciselytwo-thirds of the voice lines showed their male version first to all participants, and

29

30 Chapter 5. Discussion

# First Highest rating1 m m2 m m3 m m4 m m5 m f6 m f7 f m8 m m9 m Accuracy (M), Realism (F), Appeal (M)10 m m11 f m12 f m13 m m14 f m15 f m17 m Accuracy (M), Realism (F), Appeal (M)

Table 5.1: What gender was shown first compared to what gender got the highestrating on the voice line.

5.3. Participation feedback 31

two-thirds of the questions with all categories had a higher score on the male voicelines than the female voice lines. This means that the participants could have giventhe female voice lines a lower score because they had already heard the same voicelines before. The score gap bias towards males might have been lower, possibly if theshuffle order of the clips might have been different. Most of the time, the gap waslower than 0.5 and only between 0.1 and 0.3 in the majority of the cases and thereforeslightly different, but this would explain why so many of the male voiced lines stillhad a slightly higher score Table 5.1 which is presenting data from Table 3.8 andTable 4.5. In the categories where female voices had the highest score, there wherealways a male version of the voice line played before. Two-thirds of all participantswere also males, but the gender bias research question was never mentioned directlyin the questionnaire in Figure 4.1. Another reason the female voice lines received alower score could be if the facial mesh used was perceived as more masculine thanfeminine, and that could lower the realism and appeal scores, as shown in Figure 3.1.

5.3 Participation feedback

When all the animations had been evaluated in the questionnaire, a text section wasadded. This part is intended to give the participants an output for overall thoughtsor feedback outside the grading system. Here are a few handpicked answers:

"Male voices felt more accurate. Longer sentences lead to less accuracy."

"Short words worked really well while longer phrases got messier."

"Good animations! I thought that the mouth moved a bit to fast sometimes."

"The teeth of the upper jaw tends to jump up and down."

"The lip movements mostly does look good. But the lack of jaw movement makesit hard to judge the realism sometimes. I’m sure that’ll be added later though."

"The teeth move with the lips, particularly the top row, which is unrealistic andunappealing. The lack of animation of the jaw and nasolabial folds is also a littlejarring and takes away from the realism of the animation. The overall speech-to-lipmovement is quite good and almost always matches the voice, however."

"The upper row of teeth shouldn’t be moving, for a more realistic result"

5.3.1 Poor blendshapes

This feedback hints that the animation end and the aesthetic part might need arework. Some participants seemed to experience that some of the viseme blendshapes were too unrealistic and that the upper jaw was moving when it should not.

32 Chapter 5. Discussion

The jaw under the mouth should probably also receive more movement. If the blendshapes improved, all score category scores could have increased.

5.3.2 Longer sentences and co-articulation

The written feedback claims that the animation often looks very good in slow andsimple sentences and one-worded clips. However, when the more advanced and fastersentences are being said, the animation’s perceived quality drops a lot partly in thespeed of the mouth movement. Two possible solutions could be used to try to fixthis. The first one is to increase the frame rate slightly if the problem is that thekeyframes per 1/24 second are not accurate enough, but if this still occurs at 48keyframes per second, there is likely another problem instead. The other possiblesolution that is much more likely is that this script does not consider co-articulationand one way to apply animation curves that make the mouth movement speed lesslinear. This is also mentioned in [19].

5.4 Script based accuracyTable 4.5 lists the number of words the written voice line contains and comparesit to the number of words that got transcribed correctly. The female voice had noerrors in the transcription, but the male voice transcribed one word wrongly, makingjust one line 9 % less accurate. There is no noticeable difference between the maleand female transcriptions, and they worked very well. This script used Vosk’s latestand largest English transcription model available, with the size of 1.8Gb of machine-learned data in voice transcription. There is a more lightweight model on only 40Mbdesigned for cellphone apps that could cause a lower accuracy level [4].

Chapter 6

Conclusions and future work

This chapter summarises the text to explain what has been achieved and how andwill end with discussions on what could be done for future work.

6.1 Concluded summary

A script was written in Python that reads audio files and converts the audio data toinstructions on making a face mesh mimic the audio files dialogue has been created.The chosen input method for the script was to transcribe the dialogue to a text whereeach word had a start and end timestamp. The transcribed voice line gets rewrittento phonetics to make each sound in the word match with a viseme. Afterward,every phonetic gets calculated based on the word timestamps to estimate when thephoneme could be heard in the animation and placed in a timeline. Each phonetic inthe timeline is an instruction for the script to tell what viseme should be shown bytoggling between what blend shape is at what time. Two experiments were conductedto evaluate the script’s performance in creating animations and detect a potentialgender bias.

One experiment where participants would grade the scripts produced animationsto calculate a perceived score in accuracy, appeal, and realism through a question-naire. This experiment was conducted to see if there was a perceived quality gapbetween the male and female voice animations and to compare the quality of theanimations depending on how advanced the voice line was.

The other experiment tried to determine the voice transcription accuracy bycomparing the transcribed text with the voice actor script to see if there were wordsthe transcription component failed to understand. This test also looked at if anycharacters were created when converting the words to phonetics that the script couldnot translate to a viseme. When the data was produced, the text was comparedbetween the male and female versions of all the voice lines to check if the male andfemale voice lines got the correct word transcription. If there was a transcriptionerror, the phonetics converted voice line got examined to see if the errors wouldwrite phonetics that would show enough similar visemes that the transcription errorwould go unnoticed. The reason for conducting this experiment was to see if the voicerecognition had a gender bias. Previous research has mentioned that some voicerecognition and transcription tools have had problems understanding some wordsfemales have said [10] [13]. If a perceived overall quality were lower on the female-voiced animations, this test would prove if a gender bias from the voice transcriptioncomponent is the reason.

33

34 Chapter 6. Conclusions and future work

After running the benchmarking experiments on the lip-syncing script, thereseems to be barely any sign of a clear gender bias in the voice transcription/recogni-tion of the script that causes the female voices to be perceived as worse. It was alsoconcluded that the script’s perceived score was decent but not perfect. The scriptcould be improved by adding animation curves between the blend shape toggling andimproving the blend shapes graphical quality for a higher rating. It appears that thenull hypothesis was confirmed.

6.2 Future work suggestions based on RQ answers

The script’s accuracy in transcription between males and females has been similar.However, there is a slight noticeable gap between the animations’ perceived accuracy,appeal, and realism. It appears to be another factor that could cause the quality gapbetween the different voice actors. Therefore an experiment in comparing differentspeaking styles, e.g., phrasing and accent. As said previously, this script has notcreated a solution to handle specifically co-articulation as other solutions have donein the past [19].

This script chose to use transcriptions to words and afterward convert the wordsto phonetic because direct phoneme recognition is claimed to give disappointingresults [7]. The current method predicts when the phoneme is being pronounced bycalculating the phonemes per second speed on every word being said to suppose everyphoneme could be timestamped instead of having every word. The co-articulationproblem could get automatically improved as an experiment.

6.3 Components to improve

Users found some of the visemes unrealistic because the upper jaw moved when thelower jaw moved, which is unrealistic. Adding movement to cheekbones, jaw, andchin could improve this. It was also mentioned that more facial movement should beadded instead of just the lips.

The current scripts phonetics conversion package only supports converting En-glish words to phonetics [17]. If a package or a tool could support more languagesthan English, the script’s possibility to create lip-sync animations to other languageswould only be limited by the number of language models vosk has available. How-ever, creating a Vosk transcription model is open source, and that means everyonecan train a model [4].

6.4 Ideal testing environment

If the user study had been redone with unlimited resources, the following thingswould possibly create an improved result. The male voice actor had little experiencein voice acting. Meanwhile, the female voice actress had not done voice acting before.Therefore voice actors with the same amount of experience would probably make therating fairer between the genders. The voice actors used different microphones, andit would be ideal if the voice actors used the same microphone to ensure that all the

6.4. Ideal testing environment 35

voice lines would have been recorded with identical configurations. Sometimes themale voice lines were slightly louder, probably because different microphones wereused. A split in the participants where 50% of the participants are male and femalewould also be ideal to see if there would still be a perceived minor gender bias wheretwo-thirds of the voice lines were higher towards males or not.

References

[1] I. P. Alphabet, “Ipa symbols with html codes,” 2022, (accessed: 04.30.2022). [On-line]. Available: https://www.internationalphoneticalphabet.org/ipa-charts/ipa-symbols-with-html-codes/

[2] ——, “Ipa chart with sounds,” 2022, (accessed: 04.30.2022). [On-line]. Available: https://www.internationalphoneticalphabet.org/ipa-sounds/ipa-chart-with-sounds/

[3] Alphacephei, “Install vosk,” 2022, (accessed: 04.27.2022). [Online]. Available:https://alphacephei.com/vosk/install

[4] ——, “Model list,” 2022. [Online]. Available: https://alphacephei.com/vosk/models

[5] A. Antipov, “Human teeth,” 2022, (accessed:04.29.2022). [Online]. Available: https://sketchfab.com/3d-models/human-teeth-c4c569f0e08948e2a572007a7a5726f2

[6] Y. Chai, Y. Weng, L. Wang, and K. Zhou, “Speech-driven facialanimation with spectral gathering and temporal attention,” Frontiers ofComputer Science, vol. 16, no. 3, 2022. [Online]. Available: https://doi.org/10.1007/s11704-020-0133-7

[7] cmusphinx, “Phoneme recognition (caveat emptor),” 2022, (ac-cessed: 04.27.2022). [Online]. Available: https://cmusphinx.github.io/wiki/phonemerecognition/

[8] creativecommons, “Attribution 4.0 international (cc by 4.0)r,” 2022, (accessed:04.29.2022). [Online]. Available: https://creativecommons.org/licenses/by/4.0/

[9] D. N. (Dimid), “Speech recognition with timestamps,” 2022, (ac-cessed: 04.27.2022). [Online]. Available: https://towardsdatascience.com/speech-recognition-with-timestamps-934ede4234b2

[10] J. v. d. Kamp and V. Sundstedt, “Gaze and voice controlled drawing.” ACM,2011. [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:bth-7550

[11] J. Liu, M. You, C. Chen, and M. Song, “Real-time speech-driven animationof expressive talking faces,” International Journal of General Systems,vol. 40, no. 4, pp. 439–455, 2011, cited By :4. [Online]. Available:http://dx.doi.org/10.1080/03081079.2010.544896

[12] Meta, “Viseme references,” 2022, (accessed: 04.29.2022). [On-line]. Available: https://developer.oculus.com/documentation/unity/audio-ovrlipsync-viseme-reference/

37

38 References

[13] M. Negrão and P. Domingues, “Speechtotext: An open-source software forautomatic detection and transcription of voice recordings in digital forensics,”Forensic Science International: Digital Investigation, vol. 38, 2021, cited By :1.[Online]. Available: https://doi.org/10.1016/j.fsidi.2021.301223

[14] J. O’Donovan, J. Ward, S. Hodgins, and V. Sundstedt, “Rabbit run: Gazeand voice based game interaction,” in The 9th Irish Eurographics Workshop,Y. Morvan and V. Sundstedt, Eds., 2009.

[15] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizingobama: Learning lip sync from audio,” ACM Transactions on Graphics, vol. 36,no. 4, 2017, cited By :359. [Online]. Available: http://dx.doi.org/10.1145/3072959.3073640

[16] G. Tian, Y. Yuan, and Y. Liu, “Audio2face: Generating speech/faceanimation from single audio with attention-based bidirectional lstm networks,”in Proceedings - 2019 IEEE International Conference on Multimedia and ExpoWorkshops, ICMEW 2019, 2019, pp. 366–371, cited By :12. [Online]. Available:http://dx.doi.org/10.1109/ICMEW.2019.00069

[17] M. M. C. Timvancann, “Project description(eng-to-ipa 0.0.2),” 2022, (accessed:04.27.2022). [Online]. Available: https://pypi.org/project/eng-to-ipa/

[18] W. Tsar, “Face mesh - wael tsar,” 2022, (accessed:04.29.2022). [Online]. Available: https://sketchfab.com/3d-models/human-teeth-c4c569f0e08948e2a572007a7a5726f2

[19] Y. Xu, A. W. Feng, S. Marsella, and A. Shapiro, “A practical andconfigurable lip sync method for games,” in Proceedings - Motion in Games2013, MIG 2013, 2013, pp. 109–118, cited By :24. [Online]. Available:https://doi.org/10.1145/2522628.2522904

[20] Y. Zhou, Z. Xu, C. Landreth, E. Kalogerakis, S. Maji, and K. Singh, “Visemenet:Audio-driven animator-centric speech animation,” ACM Trans. Graph., vol. 37,no. 4, jul 2018. [Online]. Available: https://doi.org/10.1145/3197517.3201292

Appendix ASupplemental Information

39

40 Appendix A. Supplemental Information

Figure A.1: Questionnaire intro page.

41

Figure A.2: Demographics: age.

Figure A.3: Demographics: Gender.

Figure A.4: Demographics: Video game habits.

42 Appendix A. Supplemental Information

Figure A.5: CGI/3D graphics consumption habits.

43

Figure A.6: Animation evaluation page.

44 Appendix A. Supplemental Information

Figure A.7: Input for participants to mention their overall thoughts.

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden