Upload
sara-rodgers
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Toward a high-quality singing synthesizer with vocal texture
control Hui-Ling Lu
Center for Computer Research in Music and Acoustics (CCRMA)
Stanford University, Stanford, CA94305, USA
Score-to-Singing system
ScoreLyricsSinging style
Rule system
Sound synthesis
Singingvoice
Phoneme
• Lyrics-to-phoneme• Musical rules
ParametricDatabase
F0 Sound levelDurationVibrato
• Acoustic rendering• Co-articulation rules
General sound synthesis approaches
Physical
Modeling
Physical
Modeling
Spectral
Modeling
Spectral
Modeling
Source-filter
Model
Source-filter
Model
• flexible/intuitive control• expressive• co-articulation easy
ProsCons
• analysis/re-synthesiseasy
• analysis/re-synthesis difficult• invasive measurements
• less expressive• co-articulation difficult
Contributions A pseudo-physical model for singing voice synthesis which
• is an approximate physical model. • can generate high-quality non-nasal singing voice.• has analysis/re-synthesis ability.• is computationally affordable.• provides flexible control of vocal textures.
An Automatic analysis procedure for analysis/re-synthesis
A parametric model for vocal texture control
Outline
• Human voice production system
• Synthesis model
• Analysis procedure
• Vocal texture parametric model
• Vocal texture control demo
• Contributions and future directions
The human voice production system
Nasalcavity
Oralcavity
Pharyngealcavity
Tonguehump
Velum
Vocalfolds
Muscle force
Oralsoundoutput
Nasalsoundoutput
Lungs
Oscillation pattern of the vocal folds
Open phase Close phase
Opening period Closing period
• The oscillation results from the balancing of the subglottalpressure, the Bernoulli pressure and the elastic restoring forceof the vocal folds.• Prephonatory position : the initial configuration of the vocal folds before the beginning of oscillation.
Variation of vocal textures
0 100 200 300 400 500 600 700 800 900 10000
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 800 900 1000
-0.1
-0.05
0
0.05
0.1
0 100 200 300 400 500 600 700 800 900 10000
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 800 900 1000
-0.1
-0.05
0
0.05
0 100 200 300 400 500 600 700 800 900 10000
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 800 900 1000
-0.1
-0.05
0
0.05
Pressed Normal Breathy
Simplified human voice production model
GlottalSource
VocalTractFilter
Radiation11 z
Aspiration noise
• Source-tract interaction: The glottal waveform in generaldepends on the vocal tract configuration.• Neglect the source-tract interaction since the glottal impedanceis very high most of the time.
Source-filter type synthesis model
GlottalSource
VocalTractFilter
Radiation11 z
Aspiration noise
Filter
DerivativeGlottalWave
)1( 1 zAspiration noise
VocalTractFilter
Glottal excitation
Voiceoutput
Overview of the proposed synthesis model
Filter
High-passedaspiration noise
AllPoleFilter
Glottal excitation
Voiceoutput
Derivative glottal wave
NoiseResidual
Model
TransformedLiljencrants-Fant
Model
0 200 400 600 800 1000 1200 1400-0.05
0
0.05
ampl
itude
derivative glottal wave from LF model
pressed phonation
0 200 400 600 800 1000 1200 1400-0.05
0
0.05
ampl
itude normal phonation
0 200 400 600 800 1000 1200 1400-0.05
0
0.05
time index
ampl
itude breathy phonation
Transformed Liljencrants-Fant (LF) model• The transformed LF model controls the wave shape of the derivativeglottal wave via a single parameter, Rd ( wave-shape control parameter).
Transformed Liljencrants-Fant (LF) model• Transformed LF model is an extension of the LF model. It provides a control interface for the LF model to change the wave shape of the derivative glottal wave easily.
Synthesis:
Mapping
Direct synthesistiming
parametersLF model
Derivative glottalwave
Rd
Analysis:
Estimatedderivative
glottalwave
LFfitting
Direct synthesistiming
parametersMapping-1 Rd
Waveshape
controlparameter
Transformed Liljencrants-Fant (LF) model• Transformed LF model is an extension of the LF model. It provides a control interface for the LF model to change the wave shape of the derivative glottal wave easily.
Synthesis:
Mapping
Direct synthesistiming
parametersLF model
Derivative glottalwave
Rd
Analysis:
Estimatedderivative
glottalwave
LFfitting
Direct synthesistiming
parametersMapping-1 Rd
Waveshape
controlparameter
Noise residual model
Gaussian Noise
Generator
Amplitude Modulation
Noiseresidual
An GCI L
Noise floorBn
+
Vocal tract filter• An all-pole filter.• The vocal tract is assumed to be a series of concatenated uniformlossless cylindrical acoustic tubes. • Assume that sound waves obey planar propagation along the axisof the vocal tract.
AlipA1 ANA2
lip endglottis
Ug
-1
Ulip
1-kN
-kN
Vocal tract filterKelly-Lochbaum junction :
1
1
mm
mmm AA
AAk -km km
1-km
1+km
Am Am+1
Scatteringcoefficient
• If sampling period T = 2 , the transfer function of the vocal tract acoustic tubes can be shown to be an Nth order all-pole filter.• The autoregressive coefficients of the vocal tract filter can beconverted to scattering coefficients by Durbin’s method.
Um
Um
+
- Um+1
Um+1+
-
: the propagation time for sound wave to travel one acoustic tube. N : the number of acoustic tubes excluding the glottis and the lip end.
Overall synthesis model implementation
TransformedLF
model
Outputvoice
Noiseresidualmodel
Vocaltexturemodel
Degreeof
breathiness
Glottal excitation strength Ee
Ee , F0
Rd
+
(No noise input)
0.8
Fundamental frequency F0
Analysis procedure
Source-filter de-convolution
Fitting the estimated derivative
glottalwave via LF model
Inverse filtered glottal
excitation
LF modelcoefficients
Desired voice
recording
De-noising by
Wavelet Packet
Analysis
High-passedaspiration noise
Source-filter de-convolution • Synthesis model for analysis
N+1 order all pole filter
)1/(11
1
iN
ii za
BasicVoicing
Waveform(a, b, OQ)
)1/(1 1 z
Low-passfilter
Nth order All polevocal tractfilter
BasicVoicing
Waveform(a, b, OQ)
KLGLOTT88 (KL) derivative glottal wave
Source-filter de-convolution • Synthesis model for analysis
)1/(1 1 z
Low-passfilter
Nth order All polevocal tractfilter
BasicVoicing
Waveform(a, b, OQ)
KLGLOTT88 (KL) derivative glottal wave
N+1 order all pole filter
)1/(11
1
iN
ii za
BasicVoicing
Waveform(a, b, OQ)
)(ny)(ng
Source-filter deconvolution estimation flowchartVoice signal after removing the low frequency drift
One glottal period signal
Loop over different OQ values:Vocal tract filter and glottal source estimation via SUMT
EndSelect and store 5 best estimates
Loop for each period:Enforce continuity constraints via Dynamic Programming
End
Smoothing the vocal tract area by time averaging and linear interpolation
Estimated model parameter sequence
Loopfor
eachperiod
GCI detectionPhase
I
PhaseII
Convex optimization formulation
)(
)(/3/2)1(...)2()1(
))1(ˆ....)1(ˆ)((/3/2)(ˆ)(2
112
iyXA
iyXFiFiNiyiyiy
NiyaiyaiyFibFiaigig
Ti
ss
Nss
• Estimate T
N baaaX ]ˆ...ˆ[ 11 by minimizing the error between the basic voicing waveform and the estimated one.
N+1 order all pole filter
)1/(11
1
iN
ii za
BasicVoicing
Waveform(a, b, OQ)
)(ny)(ng
)(ny)ˆ1(
1
1
iN
ii za
)(ˆ ngInverse filter
Convex optimization formulation
A convex optimization problem
Minimize YAX
Subject to ,**,0,0 0 bOQTaba
YAX
my
y
A
A
mgmg
gg
Tm
T
)(
)1(
)(ˆ)(
)1(ˆ)1( 1
• Error for one glottal cycle in vector form,
• L2 norm is used
The above problem can be solved by SUMT (sequential unconstrained minimization technique).
Effective analysis/re-synthesis
• Normal phonation
original KLGLOTT88
• Pressed phonation
original KLGLOTT88
Baritone examples:
)1/(1 1 z
Low-passfilter
Nth order All polevocal tractfilter
BasicVoicing
Waveform(a, b, OQ)
KLGLOTT88 (KL) derivative glottal wave
Analysis procedure
Source-filter de-convolution
Fitting the estimated derivative
glottalwave via LF model
Inverse filtered glottal
excitation
LF modelcoefficients
Desired voice
recording
De-noising by
Wavelet Packet
Analysis
High-passedaspiration noise
De-noising by Wavelet Packet Analysis
• A noisy data record: X = f + W
De-noising by best basis thresholding :
• Transform the noisy data to another basis via Wavelet Packet Analysis : XB = fB + WB
• Thresholding out the smaller coefficients of XB by assuming that f can be compactly represented in the new basis by a few large coefficients.
• Select the wavelet filter by energy compactness criteria:1/(number of coefficients needed to accumulate 0.9 of the total energy).
Analysis procedure
Source-filter de-convolution
Fitting the estimated derivative
glottalwave via LF model
Inverse filtered glottal
excitation
LF modelcoefficients
Desired voice
recording
De-noising by
Wavelet Packet
Analysis
High-passedaspiration noise
Effective analysis/re-synthesis
• Normal phonation
original
• Pressed phonation
original
Baritone examples:
LF
LF
Vocal texture control• The parametric vocal texture control model determines the parameterizations of the glottal excitation to achieve the desired vocal texture.• Reduce the control complexity by exploring the correlationsbetween the model parameters.
TransformedLF
model
Noiseresidualmodel
Desiredvocal texture
Glottal excitationstrength
Ee
?
?
Non-breathy mode
breathy mode
Rd
Rd
Waveshape
controlparameter
Pressed and normal modes
Wave-shape control parameter Rd and normalized glottal excitation strength Ee are highly correlated.
Vocal texture control (non-breathy mode)
eEcd ebaR
~
Vocal texture control (non-breathy mode)Degree
of pressness
(apress bpress cpress)(anormal bnormal cnormal)
interpolation
eEcd ebaR
~
(a, b, c)
Rd
TransformedLF
model
Glottalexcitation
Glottal excitationstrength
Ee
Waveshape
controlparameter
• NHR is an indicator for the degree of breathiness. • The contour of the noise strength is adjusted by NHR.
Vocal texture control (breathy mode)
Desiredvocal texture
EecbaNHRRd /)/)ln((
NHR TransformedLF
model
Rd
window lagduty cycle
Bn=1 gain
+
Glottalexcitation
• NHR per glottal cycle High-passed noise energy
Glottal excitation strength Ee
An = 2.4138* Bn + 0.213 Noiseresidualmodel
Overall synthesis model implementation
TransformedLF
model
Outputvoice
Noiseresidualmodel
Vocaltexturemodel
Degreeof
breathiness
Glottal excitation strength Ee
Ee , F0
Rd
+0.8
Fundamental frequency F0
Glottalexcitation
Contributions A pseudo-physical model for singing voice synthesis which
• is an approximate physical model. • can generate high-quality non-nasal singing voice.• has analysis/re-synthesis ability.• is computationally affordable.• provides flexible control of vocal textures.
An Automatic analysis procedure for analysis/re-synthesis
A parametric model for vocal texture control
Future research
• Build a complete score-to-singing system using the proposed synthesis model. Its associated analysis procedure will be usedto construct the parametric database.
• Investigate potential usage of the source-filter deconvolutionalgorithm to low-bit rate high quality speech coding.
• Explore the application of the analysis procedure on sound transformation of vocal textures.