Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Modeling the creaky excitationfor parametric speech synthesis.
1Thomas Drugman, 2John Kane, 2Christer Gobl
September 11th, 2012Interspeech
Portland, Oregon, USA
1University of Mons, Belgium2Trinity College Dublin, Ireland
1 / 27
Creaky voice - examples
TTS corpora examples
American Male
Finnish female
Finnish Male
Conversational speech examples
Japanese female
American female
American Male
2 / 27
Creaky voice in speech
Phonetic contrast
e.g., Jalapa Mazatec
Phrase/sentence/turn boundaries
Commonly in American English, Finnish etc.
Interactive speech
Turn-takingHesitationsExpression of affective statesStylistic device
3 / 27
Creaky voice - acoustic characteristics
4 / 27
Creaky voice - acoustic characteristics
5 / 27
Problem statement
Unique acoustic characteristics of creak poorly modelled instandard vocoders
Silen et al. (2009) - improved robustness of f0 and voicingdecision
Our Aim: Provide a method for modelling the creakyexcitation to improve the timbre of creak in parametricsynthesis.
6 / 27
Speech data
American male (BDL) and Finnish male (MV)
100 sentences containing creak
7 / 27
Manual annotation
A rough quality with the sensation of repeating impulses- Ishi et al. (2008)
8 / 27
Glottal closure instants (GCIs)
Newly developed SE-VQ algorithm - Kane & Gobl, In Press
0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7−0.4−0.2
00.20.40.6
Am
plitu
de
Speech waveform
SEDREAMS − GCI
0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7−0.5
0
0.5
Am
plitu
de
Resonator output
0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7
0
0.5
1
Time (seconds)
Am
plitu
de
DEGG
DEGG − GCI
9 / 27
The deterministic plusstochastic model (DSM)
The Deterministic plus Stochastic Modelof the Residual Signal and its Applications
-Drugman & Dutoit (2012), IEEE TASLP
10 / 27
DSM - residual excitation
Univ ersité de Mons
Residual excitation
4
11 / 27
DSM - Residual frames
Univ ersité de Mons
The Deterministic plus Stochastic
Model
7
MGC Analys is
Inverse Filtering
GCI Estimation
PS Window ing
Speech Database
GCI positions
Dataset of PS residual frames
Residual signals
12 / 27
DSM - Deterministic modelling
Univ ersité de Mons
The Deterministic Component
8
Pitch Norm alizatio n
F0* Dataset of
PS residual frames
Energy Norm alizatio n
Dataset for the Deterministic
Modeling
13 / 27
DSM - vocoder
Univ ersité de Mons
The DSM vocoder
5
Deterministic component of the excitation
Stochastic component of the excitation
Filter
14 / 27
Extended DSM for creaky voice
15 / 27
DSM (creak) - Fundamental period/opening phase
100 150 200 250 300 350 4000
50
100
150
200
250
300
350
400
Fundamental period (samples)
Op
enin
g p
erio
d (
sam
ple
s)
200 250 300 350 400 450 500 5500
50
100
150
200
250
300
350
400
450
500
Fundamental period (samples)
Op
enin
g p
erio
d (
sam
ple
s)
16 / 27
DSM (creak) - Excitation modelling
Separate residual datasets for opening phase (secondary peak=> GCI) and closed phase (GCI => secondary peak)
Principal component analysis of each dataset separately,excitation model combining first eigenvectors for deterministiccomponent.
Energy envelope also derived for the two datasets separately.
17 / 27
DSM (creak) - Data-driven excitation signal
0 100 200 300 400−0.1
0
0.1
0.2
0.3
0.4
Time (samples)
Am
plitu
de
0 100 200 300 4000
0.05
0.1
0.15
0.2
0.25
Time (samples)
Am
plitu
de
18 / 27
DSM (creak) - Vocoder
19 / 27
Evaluation
20 / 27
Experimental setup
Subjective evaluation with 22 participants.
Copy-synthesis of short utterances by the American andFinnish speaker using the standard DSM vocoder and theproposed method.
ABX testOriginal utterance (X) and the two copy synthesis versions (A& B). Select most like original
Comparative Mean Opinion Score (CMOS) testCopy synthesis by both vocoders - signal preference on gradual7 point CMOS scale.
21 / 27
Results - ABX
22 / 27
Results - Comparative Mean Opinion Score (CMOS)
23 / 27
Results - Samples
American Male
1 Original standard HTS vocoder DSM vocoder DSM-creak
2 Original standard HTS vocoder DSM vocoder DSM-creak
3 Original standard HTS vocoder DSM vocoder DSM-creak
Finnish Male
1 Original standard HTS vocoder DSM vocoder DSM-creak
2 Original standard HTS vocoder DSM vocoder DSM-creak
3 Original standard HTS vocoder DSM vocoder DSM-creak
24 / 27
Ongoing/future research directions
Automate creak segmentation (see our poster at specialsession - glottal source processing!)
Prediction of creaky regions from contextual features (e.g.,phoneme, word stress, position in sentence, prosodic contextetc.)
Transformation of speakers voice characteristics.
25 / 27
Acknowledgements
This work was supported by the Science Foundation Ireland,Grant 07 / CE / I 1142 (Centre for Next GenerationLocalisation, www.cngl.ie) and Grant 09 / IN.1 / I 2631(FASTNET).
26 / 27
Thank you!
27 / 27