24
Synthesis of Child Synthesis of Child Speech With HMM Speech With HMM Adaptation and Voice Adaptation and Voice Conversion Conversion Oliver Watts, Junichi Yamagishi, Member, Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, IEEE, Simon King, Senior Member, IEEE, and Kay Berkling, Senior Member, and Kay Berkling, Senior Member, IEEE,IEEE TRANSACTIONS ON AUDIO, SPEECH, IEEE,IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010 JULY 2010 Adviser: Dr. Yeou - Jiunn Chen Presenter: Ming –Da Lee

Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Embed Size (px)

Citation preview

Page 1: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Synthesis of Child Speech Synthesis of Child Speech With HMM Adaptation and With HMM Adaptation and

Voice ConversionVoice ConversionOliver Watts, Junichi Yamagishi, Member, IEEE, Simon Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and Kay Berkling, Senior King, Senior Member, IEEE, and Kay Berkling, Senior Member, IEEE,IEEE TRANSACTIONS ON AUDIO, Member, IEEE,IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010NO. 5, JULY 2010

Adviser: Dr. Yeou - Jiunn ChenPresenter: Ming –Da Lee

Page 2: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

OOutlineutline

IntroductionIntroduction Child speech dataChild speech data The systemsThe systems Evaluation Evaluation ConclusionConclusion ReferenceReference

Page 3: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

IntroductionIntroduction

The synthesis of child speech presents special The synthesis of child speech presents special difficulties for the data-driven speech difficulties for the data-driven speech synthesis systems synthesis systems The type of child speech corpus typically availableThe type of child speech corpus typically available Two typesTwo types

Unit selection synthesisUnit selection synthesis Statistical parametric approachesStatistical parametric approaches

Page 4: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

IntroductionIntroduction

Unit selection synthesis Unit selection synthesis To produce waveforms for arbitrary novel To produce waveforms for arbitrary novel

utterances.utterances. To reuse existing sections of waveform from To reuse existing sections of waveform from

a database.a database. If the database is imperfectIf the database is imperfect

A direct impact on the quality of the speech A direct impact on the quality of the speech synthesissynthesis

Speaker inconsistency, background noise, and Speaker inconsistency, background noise, and poor phonetic coverage.poor phonetic coverage.

Page 5: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

IntroductionIntroduction

Statistical parametric approaches to speech Statistical parametric approaches to speech synthesissynthesis Hidden Markov model (HMM)-based speech Hidden Markov model (HMM)-based speech

synthesissynthesis

Page 6: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

IntroductionIntroduction HMMs baseHMMs base

To be trained on cleanly To be trained on cleanly recorded datarecorded data

Rich in phonetic contextsRich in phonetic contexts High-quality speechHigh-quality speech

The adaptation data is noisy The adaptation data is noisy and sparseand sparse

Page 7: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

IntroductionIntroduction

Adaptation techniquesAdaptation techniques Data-driven synthesizer of child speechData-driven synthesizer of child speech

This work with fuller analysisThis work with fuller analysis HMM adaptation techniques and techniques from HMM adaptation techniques and techniques from

voice conversion of an existing synthesizer to a voice conversion of an existing synthesizer to a child speaker.child speaker.

Page 8: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Child speech dataChild speech data

Page 9: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Child speech dataChild speech data

Type-Token Ratios (TTR)

Page 10: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Child speech dataChild speech data

Page 11: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Child speech dataChild speech data

Page 12: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

The systemsThe systems

Page 13: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

The systemsThe systemsSpeaker-Dependent Systems (A, C, E)

Speaker Adaptive Systems (B, D, F):CMU-ARCTIC

Systems M, N, and O were all designed to be compared with system L .

Systems Q, R, and S were all designed to be compared with system P .

Page 14: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

EvaluationEvaluation

We used sentences from the corpus for this part of the test. 48 paid listeners,all native speakers of English between the ages of 18 and 25.

Page 15: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

EvaluationEvaluation

Page 16: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

EvaluationEvaluation

Page 17: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

EvaluationEvaluation

Page 18: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Evaluation Evaluation

Results of pairwise Wilcoxon signed rank tests between systems; a black square shows a significant difference between systems with α =0.01(with Bonferroni correction).

Page 19: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

EvaluationEvaluation

Results of XAB test for speaker individuality, comparisons Results of XAB test for speaker individuality, comparisons among systems F, I, J, and K. Vertical lines show 95% among systems F, I, J, and K. Vertical lines show 95% confidence intervals (with Bonferroni correction).confidence intervals (with Bonferroni correction).

Page 20: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

EvaluationEvaluation

Results of XAB test for speaker individuality; comparisons Results of XAB test for speaker individuality; comparisons among systems L–S, Vertical lines show 95% confidence among systems L–S, Vertical lines show 95% confidence intervals (with Bonferroni correction).intervals (with Bonferroni correction).

Page 21: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

ConclusionConclusion

When the adaptation data is restricted to 15 When the adaptation data is restricted to 15 min, there was no significant preference for min, there was no significant preference for either HMM adaptation or voice conversion either HMM adaptation or voice conversion methods.methods.

Page 22: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

HMM adaptation was preferred in every caseHMM adaptation was preferred in every case Using the full target speaker corpus. Using the full target speaker corpus. This is because relatively large amounts of data This is because relatively large amounts of data

enable extensive use of the decision tree.enable extensive use of the decision tree. Incorporates high-level linguistic and prosodic Incorporates high-level linguistic and prosodic

information in speaker adaptation. information in speaker adaptation.

Page 23: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Thank you

Page 24: Synthesis of Child Speech With HMM Adaptation and Voice Conversion Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and

Reference Junichi Yamagishi, Member, IEEE, Takashi Nose, Heiga Zen, Zhen-Hua

Ling, Tomoki Toda, Member, IEEE, Keiichi Tokuda, Member, IEEE, Simon King, Senior Member, IEEE, and Steve Renals, Member, EEE“Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis” IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 6, AUGUST 2009