Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University,

Toward a high-quality singing synthesizer with vocal texture

control Hui-Ling Lu

Center for Computer Research in Music and Acoustics (CCRMA)

Stanford University, Stanford, CA94305, USA

Score-to-Singing system

ScoreLyricsSinging style

Rule system

Sound synthesis

Singingvoice

Phoneme

• Lyrics-to-phoneme• Musical rules

ParametricDatabase

F0 Sound levelDurationVibrato

• Acoustic rendering• Co-articulation rules

General sound synthesis approaches

Physical

Modeling

Physical

Modeling

Spectral

Modeling

Spectral

Modeling

Source-filter

Model

Source-filter

Model

• flexible/intuitive control• expressive• co-articulation easy

ProsCons

• analysis/re-synthesiseasy

• analysis/re-synthesis difficult• invasive measurements

• less expressive• co-articulation difficult

Contributions A pseudo-physical model for singing voice synthesis which

• is an approximate physical model. • can generate high-quality non-nasal singing voice.• has analysis/re-synthesis ability.• is computationally affordable.• provides flexible control of vocal textures.

An Automatic analysis procedure for analysis/re-synthesis

A parametric model for vocal texture control

Outline

• Human voice production system

• Synthesis model

• Analysis procedure

• Vocal texture parametric model

• Vocal texture control demo

• Contributions and future directions

The human voice production system

Nasalcavity

Oralcavity

Pharyngealcavity

Tonguehump

Velum

Vocalfolds

Muscle force

Oralsoundoutput

Nasalsoundoutput

Lungs

Oscillation pattern of the vocal folds

Open phase Close phase

Opening period Closing period

• The oscillation results from the balancing of the subglottalpressure, the Bernoulli pressure and the elastic restoring forceof the vocal folds.• Prephonatory position : the initial configuration of the vocal folds before the beginning of oscillation.

Variation of vocal textures

0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900 1000

-0.1

-0.05

0

0.05

0.1

0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900 1000

-0.1

-0.05

0

0.05

0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 900 1000

-0.1

-0.05

0

0.05

Pressed Normal Breathy

Simplified human voice production model

GlottalSource

VocalTractFilter

Radiation11 z

Aspiration noise

• Source-tract interaction: The glottal waveform in generaldepends on the vocal tract configuration.• Neglect the source-tract interaction since the glottal impedanceis very high most of the time.

Source-filter type synthesis model

GlottalSource

VocalTractFilter

Radiation11 z

Aspiration noise

Filter

DerivativeGlottalWave

)1( 1 zAspiration noise

VocalTractFilter

Glottal excitation

Voiceoutput

Overview of the proposed synthesis model

Filter

High-passedaspiration noise

AllPoleFilter

Glottal excitation

Voiceoutput

Derivative glottal wave

NoiseResidual

Model

TransformedLiljencrants-Fant

Model

0 200 400 600 800 1000 1200 1400-0.05

0

0.05

ampl

itude

derivative glottal wave from LF model

pressed phonation

0 200 400 600 800 1000 1200 1400-0.05

0

0.05

ampl

itude normal phonation

0 200 400 600 800 1000 1200 1400-0.05

0

0.05

time index

ampl

itude breathy phonation

Transformed Liljencrants-Fant (LF) model• The transformed LF model controls the wave shape of the derivativeglottal wave via a single parameter, Rd ( wave-shape control parameter).

Transformed Liljencrants-Fant (LF) model• Transformed LF model is an extension of the LF model. It provides a control interface for the LF model to change the wave shape of the derivative glottal wave easily.

Synthesis:

Mapping

Direct synthesistiming

parametersLF model

Derivative glottalwave

Rd

Analysis:

Estimatedderivative

glottalwave

LFfitting


parametersMapping-1 Rd

Waveshape

controlparameter

)sin()( 0 teEtg gt

)()( ece TTTt

a

e eeT

E

eTt 0,

0, TTtT ce

Liljencrants-Fant (LF) model

Transformed Liljencrants-Fant (LF) model• Transformed LF model is an extension of the LF model. It provides a control interface for the LF model to change the wave shape of the derivative glottal wave easily.

Synthesis:

Mapping


parametersLF model

Derivative glottalwave

Rd

Analysis:

Estimatedderivative

glottalwave

LFfitting


parametersMapping-1 Rd

Waveshape

controlparameter

Noise residual model

Gaussian Noise

Generator

Amplitude Modulation

Noiseresidual

An GCI L

Noise floorBn

+

Vocal tract filter• An all-pole filter.• The vocal tract is assumed to be a series of concatenated uniformlossless cylindrical acoustic tubes. • Assume that sound waves obey planar propagation along the axisof the vocal tract.

AlipA1 ANA2

lip endglottis

Ug

-1

Ulip

1-kN

-kN

Vocal tract filterKelly-Lochbaum junction :

1

1

mm

mmm AA

AAk -km km

1-km

1+km

Am Am+1

Scatteringcoefficient

• If sampling period T = 2 , the transfer function of the vocal tract acoustic tubes can be shown to be an Nth order all-pole filter.• The autoregressive coefficients of the vocal tract filter can beconverted to scattering coefficients by Durbin’s method.

Um

Um

+

- Um+1

Um+1+

-

: the propagation time for sound wave to travel one acoustic tube. N : the number of acoustic tubes excluding the glottis and the lip end.

Overall synthesis model implementation

TransformedLF

model

Outputvoice

Noiseresidualmodel

Vocaltexturemodel

Degreeof

breathiness

Glottal excitation strength Ee

Ee , F0

Rd

+

(No noise input)

0.8

Fundamental frequency F0

Analysis procedure

Source-filter de-convolution

Fitting the estimated derivative

glottalwave via LF model

Inverse filtered glottal

excitation

LF modelcoefficients

Desired voice

recording

De-noising by

Wavelet Packet

Analysis


Source-filter de-convolution • Synthesis model for analysis

N+1 order all pole filter

)1/(11

1

iN

ii za

BasicVoicing

Waveform(a, b, OQ)

)1/(1 1 z

Low-passfilter

Nth order All polevocal tractfilter

BasicVoicing

Waveform(a, b, OQ)

KLGLOTT88 (KL) derivative glottal wave

ss

sss

FTnFOQT

FOQTnFnbFanng

00

02

,0

0,)/(3/2)(

,)(4

27

02 TOQ

AVa

)(4

272

03 TOQ

AVb

Source-filter de-convolution • Synthesis model for analysis

)1/(1 1 z

Low-passfilter


BasicVoicing

Waveform(a, b, OQ)



)1/(11

1

iN

ii za

BasicVoicing

Waveform(a, b, OQ)

)(ny)(ng

Source-filter deconvolution estimation flowchartVoice signal after removing the low frequency drift

One glottal period signal

Loop over different OQ values:Vocal tract filter and glottal source estimation via SUMT

EndSelect and store 5 best estimates

Loop for each period:Enforce continuity constraints via Dynamic Programming

End

Smoothing the vocal tract area by time averaging and linear interpolation

Estimated model parameter sequence

Loopfor

eachperiod

GCI detectionPhase

I

PhaseII

Convex optimization formulation

)(

)(/3/2)1(...)2()1(

))1(ˆ....)1(ˆ)((/3/2)(ˆ)(2

112

iyXA

iyXFiFiNiyiyiy

NiyaiyaiyFibFiaigig

Ti

ss

Nss

• Estimate T

N baaaX ]ˆ...ˆ[ 11 by minimizing the error between the basic voicing waveform and the estimated one.


)1/(11

1

iN

ii za

BasicVoicing

Waveform(a, b, OQ)

)(ny)(ng

)(ny)ˆ1(

1

1

iN

ii za

)(ˆ ngInverse filter

Convex optimization formulation

A convex optimization problem

Minimize YAX

Subject to ,**,0,0 0 bOQTaba

YAX

my

y

A

A

mgmg

gg

Tm

T

)(

)1(

)(ˆ)(

)1(ˆ)1( 1

• Error for one glottal cycle in vector form,

• L2 norm is used

The above problem can be solved by SUMT (sequential unconstrained minimization technique).

De-convolution result (synthetic data)

Effective analysis/re-synthesis

• Normal phonation

original KLGLOTT88

• Pressed phonation

original KLGLOTT88

Baritone examples:

)1/(1 1 z

Low-passfilter


BasicVoicing

Waveform(a, b, OQ)


Analysis procedure





excitation


Desired voice

recording

De-noising by

Wavelet Packet

Analysis


De-noising by Wavelet Packet Analysis

• A noisy data record: X = f + W

De-noising by best basis thresholding :

• Transform the noisy data to another basis via Wavelet Packet Analysis : XB = fB + WB

• Thresholding out the smaller coefficients of XB by assuming that f can be compactly represented in the new basis by a few large coefficients.

• Select the wavelet filter by energy compactness criteria:1/(number of coefficients needed to accumulate 0.9 of the total energy).

De-noising result (synthetic data)

Analysis procedure





excitation


Desired voice

recording

De-noising by

Wavelet Packet

Analysis


Effective analysis/re-synthesis

• Normal phonation

original

• Pressed phonation

original

Baritone examples:

LF

LF

Vocal texture control• The parametric vocal texture control model determines the parameterizations of the glottal excitation to achieve the desired vocal texture.• Reduce the control complexity by exploring the correlationsbetween the model parameters.

TransformedLF

model

Noiseresidualmodel

Desiredvocal texture

Glottal excitationstrength

Ee

?

?

Non-breathy mode

breathy mode

Rd

Rd

Waveshape

controlparameter

Pressed and normal modes

Wave-shape control parameter Rd and normalized glottal excitation strength Ee are highly correlated.

Vocal texture control (non-breathy mode)

eEcd ebaR

~

Vocal texture control (non-breathy mode)Degree

of pressness

(apress bpress cpress)(anormal bnormal cnormal)

interpolation

eEcd ebaR

~

(a, b, c)

Rd

TransformedLF

model

Glottalexcitation

Glottal excitationstrength

Ee

Waveshape

controlparameter

• NHR is an indicator for the degree of breathiness. • The contour of the noise strength is adjusted by NHR.

Vocal texture control (breathy mode)

Desiredvocal texture

EecbaNHRRd /)/)ln((

NHR TransformedLF

model

Rd

window lagduty cycle

Bn=1 gain

+

Glottalexcitation

• NHR per glottal cycle High-passed noise energy


An = 2.4138* Bn + 0.213 Noiseresidualmodel

Overall synthesis model implementation

TransformedLF

model

Outputvoice

Noiseresidualmodel

Vocaltexturemodel

Degreeof

breathiness


Ee , F0

Rd

+0.8

Fundamental frequency F0

Glottalexcitation

Vocal texture control demo

Application

Contributions A pseudo-physical model for singing voice synthesis which

• is an approximate physical model. • can generate high-quality non-nasal singing voice.• has analysis/re-synthesis ability.• is computationally affordable.• provides flexible control of vocal textures.

An Automatic analysis procedure for analysis/re-synthesis

A parametric model for vocal texture control

Future research

• Build a complete score-to-singing system using the proposed synthesis model. Its associated analysis procedure will be usedto construct the parametric database.

• Investigate potential usage of the source-filter deconvolutionalgorithm to low-bit rate high quality speech coding.

• Explore the application of the analysis procedure on sound transformation of vocal textures.

Thank you !

Documents

Toward a high-quality singing synthesizer with vocal texture control Hui-Ling Lu Center for Computer Research in Music and Acoustics (CCRMA) Stanford University,