Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Synthesis of Child Speech Synthesis of Child Speech With HMM Adaptation and With HMM Adaptation and

Voice ConversionVoice ConversionOliver Watts, Junichi Yamagishi, Member, IEEE, Simon Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and Kay Berkling, Senior King, Senior Member, IEEE, and Kay Berkling, Senior Member, IEEE,IEEE TRANSACTIONS ON AUDIO, Member, IEEE,IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010NO. 5, JULY 2010

Adviser: Dr. Yeou - Jiunn ChenPresenter: Ming –Da Lee

OOutlineutline

IntroductionIntroduction Child speech dataChild speech data The systemsThe systems Evaluation Evaluation ConclusionConclusion ReferenceReference

IntroductionIntroduction

The synthesis of child speech presents special The synthesis of child speech presents special difficulties for the data-driven speech difficulties for the data-driven speech synthesis systems synthesis systems The type of child speech corpus typically availableThe type of child speech corpus typically available Two typesTwo types

Unit selection synthesisUnit selection synthesis Statistical parametric approachesStatistical parametric approaches


Unit selection synthesis Unit selection synthesis To produce waveforms for arbitrary novel To produce waveforms for arbitrary novel

utterances.utterances. To reuse existing sections of waveform from To reuse existing sections of waveform from

a database.a database. If the database is imperfectIf the database is imperfect

A direct impact on the quality of the speech A direct impact on the quality of the speech synthesissynthesis

Speaker inconsistency, background noise, and Speaker inconsistency, background noise, and poor phonetic coverage.poor phonetic coverage.


Statistical parametric approaches to speech Statistical parametric approaches to speech synthesissynthesis Hidden Markov model (HMM)-based speech Hidden Markov model (HMM)-based speech

synthesissynthesis

IntroductionIntroduction HMMs baseHMMs base

To be trained on cleanly To be trained on cleanly recorded datarecorded data

Rich in phonetic contextsRich in phonetic contexts High-quality speechHigh-quality speech

The adaptation data is noisy The adaptation data is noisy and sparseand sparse


Adaptation techniquesAdaptation techniques Data-driven synthesizer of child speechData-driven synthesizer of child speech

This work with fuller analysisThis work with fuller analysis HMM adaptation techniques and techniques from HMM adaptation techniques and techniques from

voice conversion of an existing synthesizer to a voice conversion of an existing synthesizer to a child speaker.child speaker.

Child speech dataChild speech data


Type-Token Ratios (TTR)



The systemsThe systems

The systemsThe systemsSpeaker-Dependent Systems (A, C, E)

Speaker Adaptive Systems (B, D, F):CMU-ARCTIC

Systems M, N, and O were all designed to be compared with system L .

Systems Q, R, and S were all designed to be compared with system P .

EvaluationEvaluation

We used sentences from the corpus for this part of the test. 48 paid listeners,all native speakers of English between the ages of 18 and 25.




Evaluation Evaluation

Results of pairwise Wilcoxon signed rank tests between systems; a black square shows a significant difference between systems with α =0.01(with Bonferroni correction).


Results of XAB test for speaker individuality, comparisons Results of XAB test for speaker individuality, comparisons among systems F, I, J, and K. Vertical lines show 95% among systems F, I, J, and K. Vertical lines show 95% confidence intervals (with Bonferroni correction).confidence intervals (with Bonferroni correction).


Results of XAB test for speaker individuality; comparisons Results of XAB test for speaker individuality; comparisons among systems L–S, Vertical lines show 95% confidence among systems L–S, Vertical lines show 95% confidence intervals (with Bonferroni correction).intervals (with Bonferroni correction).

ConclusionConclusion

When the adaptation data is restricted to 15 When the adaptation data is restricted to 15 min, there was no significant preference for min, there was no significant preference for either HMM adaptation or voice conversion either HMM adaptation or voice conversion methods.methods.

HMM adaptation was preferred in every caseHMM adaptation was preferred in every case Using the full target speaker corpus. Using the full target speaker corpus. This is because relatively large amounts of data This is because relatively large amounts of data

enable extensive use of the decision tree.enable extensive use of the decision tree. Incorporates high-level linguistic and prosodic Incorporates high-level linguistic and prosodic

information in speaker adaptation. information in speaker adaptation.

Thank you

Reference Junichi Yamagishi, Member, IEEE, Takashi Nose, Heiga Zen, Zhen-Hua

Ling, Tomoki Toda, Member, IEEE, Keiichi Tokuda, Member, IEEE, Simon King, Senior Member, IEEE, and Steve Renals, Member, EEE“Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis” IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 6, AUGUST 2009

Documents

Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and