Modeling the creaky excitation for parametric speech

Modeling the creaky excitationfor parametric speech synthesis.

1Thomas Drugman, 2John Kane, 2Christer Gobl

September 11th, 2012Interspeech

Portland, Oregon, USA

1University of Mons, Belgium2Trinity College Dublin, Ireland

1 / 27

Creaky voice - examples

TTS corpora examples

American Male

Finnish female

Finnish Male

Conversational speech examples

Japanese female

American female

American Male

2 / 27

Creaky voice in speech

Phonetic contrast

e.g., Jalapa Mazatec

Phrase/sentence/turn boundaries

Commonly in American English, Finnish etc.

Interactive speech

Turn-takingHesitationsExpression of affective statesStylistic device

3 / 27

Creaky voice - acoustic characteristics

4 / 27

Creaky voice - acoustic characteristics

5 / 27

Problem statement

Unique acoustic characteristics of creak poorly modelled instandard vocoders

Silen et al. (2009) - improved robustness of f0 and voicingdecision

Our Aim: Provide a method for modelling the creakyexcitation to improve the timbre of creak in parametricsynthesis.

6 / 27

Speech data

American male (BDL) and Finnish male (MV)

100 sentences containing creak

7 / 27

Manual annotation

A rough quality with the sensation of repeating impulses- Ishi et al. (2008)

8 / 27

Glottal closure instants (GCIs)

Newly developed SE-VQ algorithm - Kane & Gobl, In Press

0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7−0.4−0.2

00.20.40.6

Am

plitu

de

Speech waveform

SEDREAMS − GCI

0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7−0.5

0

0.5

Am

plitu

de

Resonator output

0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7

0

0.5

1

Time (seconds)

Am

plitu

de

DEGG

DEGG − GCI

9 / 27

The deterministic plusstochastic model (DSM)

The Deterministic plus Stochastic Modelof the Residual Signal and its Applications

-Drugman & Dutoit (2012), IEEE TASLP

10 / 27

DSM - residual excitation

Univ ersité de Mons

Residual excitation

4

11 / 27

DSM - Residual frames


The Deterministic plus Stochastic

Model

7

MGC Analys is

Inverse Filtering

GCI Estimation

PS Window ing

Speech Database

GCI positions

Dataset of PS residual frames

Residual signals

12 / 27

DSM - Deterministic modelling


The Deterministic Component

8

Pitch Norm alizatio n

F0* Dataset of

PS residual frames

Energy Norm alizatio n

Dataset for the Deterministic

Modeling

13 / 27

DSM - vocoder


The DSM vocoder

5

Deterministic component of the excitation

Stochastic component of the excitation

Filter

14 / 27

Extended DSM for creaky voice

15 / 27

DSM (creak) - Fundamental period/opening phase

100 150 200 250 300 350 4000

50

100

150

200

250

300

350

400

Fundamental period (samples)

Op

enin

g p

erio

d (

sam

ple

s)

200 250 300 350 400 450 500 5500

50

100

150

200

250

300

350

400

450

500

Fundamental period (samples)

Op

enin

g p

erio

d (

sam

ple

s)

16 / 27

DSM (creak) - Excitation modelling

Separate residual datasets for opening phase (secondary peak=> GCI) and closed phase (GCI => secondary peak)

Principal component analysis of each dataset separately,excitation model combining first eigenvectors for deterministiccomponent.

Energy envelope also derived for the two datasets separately.

17 / 27

DSM (creak) - Data-driven excitation signal

0 100 200 300 400−0.1

0

0.1

0.2

0.3

0.4

Time (samples)

Am

plitu

de

0 100 200 300 4000

0.05

0.1

0.15

0.2

0.25

Time (samples)

Am

plitu

de

18 / 27

DSM (creak) - Vocoder

19 / 27

Evaluation

20 / 27

Experimental setup

Subjective evaluation with 22 participants.

Copy-synthesis of short utterances by the American andFinnish speaker using the standard DSM vocoder and theproposed method.

ABX testOriginal utterance (X) and the two copy synthesis versions (A& B). Select most like original

Comparative Mean Opinion Score (CMOS) testCopy synthesis by both vocoders - signal preference on gradual7 point CMOS scale.

21 / 27

Results - ABX

22 / 27

Results - Comparative Mean Opinion Score (CMOS)

23 / 27

Results - Samples

American Male

1 Original standard HTS vocoder DSM vocoder DSM-creak



Finnish Male




24 / 27

Ongoing/future research directions

Automate creak segmentation (see our poster at specialsession - glottal source processing!)

Prediction of creaky regions from contextual features (e.g.,phoneme, word stress, position in sentence, prosodic contextetc.)

Transformation of speakers voice characteristics.

25 / 27

Acknowledgements

This work was supported by the Science Foundation Ireland,Grant 07 / CE / I 1142 (Centre for Next GenerationLocalisation, www.cngl.ie) and Grant 09 / IN.1 / I 2631(FASTNET).

26 / 27

www.cngl.ie

Thank you!

27 / 27

Documents

Modeling the creaky excitation for parametric speech