58
Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Embed Size (px)

Citation preview

Page 1: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Voice Conversion

Dr. Elizabeth GodoySpeech Processing Guest Lecture

December 11, 2012

Page 2: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

E.Godoy Biography

I. United States (Rhode Island): Native Hometown: Middletown, RI Undergrad & Masters at MIT (Boston Area)

II. France (Lannion): 2007-2011 Worked on my PhD at Orange Labs

III. Iraklio: Current Work at FORTH with Prof Stylianou

December 11, 20122 E.Godoy, Voice Conversion

Page 3: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Professional Background

I. B.S. & M.Eng from MIT, Electrical Eng. Specialty in Signal Processing

Underwater acoustics: target physics, environmental modeling, torpedo homing

Antenna beamforming (Masters): wireless networks

II. PhD in Signal Processing at Orange Labs Speech Processing: Voice Conversion

Speech Synthesis Team (Text-to-Speech) Focus on Spectral Envelope Transformation

III. Post-Doctoral Research at FORTH LISTA: Speech in Noise & Intelligibility

Analyses of Human Speaking Styles (e.g. Lombard, Clear) Speech Modifications to Improve Intelligibility

December 11, 20123 E.Godoy, Voice Conversion

Page 4: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Today’s Lecture: Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 4

I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion

II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping +

Amplitude Scaling

III. Conversion Results Objective Metrics & Subjective Evaluations Sound Samples

IV. Summary & Conclusions

Page 5: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Today’s Lecture: Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 5

I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion

Page 6: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Voice Conversion (VC)

December 11, 2012E.Godoy, Voice Conversion 6

Transform the speech of a (source) speaker so that it sounds like the speech of a different (target) speaker.

This is awesome!

He sounds like me!

Voice Conversi

onThis is awesome!

source target

???

Page 7: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Context: Speech Synthesis

December 11, 2012E.Godoy, Voice Conversion 7 December 11, 2012E.Godoy, Guest Lecture7

Increase in applications using speech technologies Cell phones, GPS, video gaming, customer

service apps…

Information communicated through speech!

Text-to-Speech (TTS) Synthesis Generate speech from a given text

Turn left! Insert your card.

Next Stop: Lannion

Text-to-Speech!

This is Abraham Lincoln speaking…Ha ha!

Page 8: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Text-to-Speech (TTS) Systems

December 11, 2012E.Godoy, Voice Conversion 8

TTS Approaches

1. Concatenative: speech synthesized from recorded segments

Unit-Selection: parts of speech chosen from corpora & strung together

High-quality synthesis, but need to record & process corpora

2. Parametric: speech generated from model parameters HMM-based: speaker models built from speech using linguistic

infoLimited quality due to simplified speech modeling & statistical averaging

Concatenativeor

Paramateric?

Page 9: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Text-to-Speech (TTS) Example

December 11, 2012E.Godoy, Voice Conversion 9

voice

Page 10: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Voice Conversion: TTS Motivation

December 11, 2012E.Godoy, Voice Conversion 10

Concatenative speech synthesis High-quality speech But, need to record & process a large corpora for

each voice

Voice Conversion Create different voices by speech-to-speech

transformation Focus on acoustics of voices

Page 11: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

What gives a voice an identity?

December 11, 2012E.Godoy, Voice Conversion 11

“Voice” notion of identity (voice rather than speech)

Characterize speech based on different levels

1. Segmental Pitch – fundamental frequency Timbre – distinguishes between different types

of sounds

2. Supra-Segmental Prosody – intonation & rhythm of speech

Page 12: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Goals of Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 12

1. Synthesize High-Quality Speech Maintain quality of source speech (limit

degradations)

2. Capture Target Speaker Identity Requires learning between source & target

features

Difficult task! significant modifications of source speech needed

that risk severely degrading speech quality…

Page 13: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Stages of Voice Conversion

Key Parameter: the spectral envelope (relation to timbre)

December 11, 2012E.Godoy, Voice Conversion 13

CORPORA

TARGET speech

SOURCE speech

PARAMETEREXTRACTION

Converted Speech SYNTHESIS

PARAMETEREXTRACTION

1) Analysis, 2) Learning, 3) Transformation

LEARNING• Generating Acoustic Feature Spaces• Establish Mappings between Source & Target Parameters

TRANSFORMATION• Classify Source Parameters in Feature Space• Apply Transformation Function

Page 14: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Today’s Lecture: Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 14

I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion

II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping +

Amplitude Scaling

Page 15: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

The Spectral Envelope Spectral Envelope: curve approximating the DFT magnitude

Related to voice timbre, plays a key role in many speech applications: Coding, Recognition, Synthesis, Voice transformation/conversion

Voice Conversion: important for both speech quality and voice identity

0 1000 2000 3000 4000 5000 6000 7000 8000-120

-100

-80

-60

-40

-20

0

December 11, 201215 E.Godoy, Voice Conversion

Page 16: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Spectral Envelope Parameterization

Two common methods

1) Cepstrum - Discrete Cepstral Coefficients - Mel-Frequency Cepstral Coefficients

(MFCC) change the frequency scale to reflect

bands of human hearing

2) Linear Prediction (LP) - Line Spectral Frequencies (LSF)

December 11, 201216 E.Godoy, Voice Conversion

Page 17: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Standard Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 17

Parallel corpora: source & target utter same sentences Parameters are spectral features (e.g. vectors of cepstral

coefficients) Alignment of speech frames in time

Standard: Gaussian Mixture Model

Corpora(Parallel)

Extract target parameters

Extract source parameters Converted

Speech synthesis

Align source & target frames LEARNING

TRANSFORMATION

Focus: Learning & Transforming the Spectral Envelope

Page 18: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Today’s Lecture: Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 18

I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion

II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model

Formulation Limitations

1. Acoustic mappings between source & target parameters2. Over-smoothing of the spectral envelope

Proposed: Dynamic Frequency Warping + Amplitude Scaling

Page 19: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Gaussian Mixture Model for VC

December 11, 2012E.Godoy, Voice Conversion 19

Origins: Evolved from "fuzzy" Vector Quantization (i.e. VQ with

"soft" classification) Originally proposed by [Stylianou et al; 98] Joint learning of GMM (most common) by [Kain et al; 98]

Underlying principle: Exploit joint statistics exhibited by aligned source & target

frames

Methodology: Represent distributions of spectral feature vectors as mix

of Q Gaussians Transformation function then based on MMSE criterion

Page 20: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

GMM-based Spectral Transformation

1) Align N spectral feature vectors in time. (discrete cepstral coeffs)

2) Represent PDF of vectors as mixture of Q multivariate Gaussians

Learn from Expectation Maximization (EM) on Z

3) Transform source vectors using weighted mixture of Maximum Likelihood (ML) estimator for each component.

: probability source frame belongs to acoustic class described by component q (calculated

in Decoding)

0,1),,;()(11

q

Q

q

q

Q

q

qqq zNzp

Qqqqq :1,,,

)( nxq xw

YXZyyYxxX NN , :joint,,..., :target,,..., :source 11

E.Godoy, Voice Conversion

Q

q

xqn

xxq

yxq

yqn

xqnn xxwxy

1

1)()()(ˆ

Page 21: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

GMM-Transformation Steps

1) Source frame want to estimate target vector:

2) Classify calculate

: probability source frame belongs to acoustic class described by component q (Decoding step)

3) Apply transformation function:

)( nxq xw

E.Godoy, Voice Conversion

Q

q

xqn

xxq

yxq

yqn

xqnn xxwxy

1

1)()()(ˆ

nx ny

nx )( nxq xw

weighted sum ML estimator for class

Page 22: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Acoustically Aligning Source & Target Speech

First question: are the acoustic events from the source & target speech appropriately associated?

Time alignment Acoustic alignment ???

One-to-Many Problem Occurs when single acoustic event of source aligned to

multiple acoustic events of target [Mouchtaris et al; 07] Problem is that cannot distinguish distinct target

events given only source information

Source: one single event As ,

Target: two distinct events Bt, Ct

[As ; Bt] [As ; Ct]

Joint Frames Acoustic Space

As

Bt

Ct-4 -2 0 2 4 6 8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

December 11, 201222 E.Godoy, Voice Conversion

Page 23: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Cluster by Phoneme (“Phonetic GMM”)

Motivation: eliminate mixtures to alleviate one-to-many problem!

Introduce contextual information Phoneme (unit of speech describing linguistic sound):

/a/,/e/,/n/, etc

Formulation: cluster frames according to phoneme label

Each Gaussian class q then corresponds to a phoneme:

Outcome: error indeed decreases using classification by phoneme Specifically, errors from one-to-many mappings

reduced!

qN

N

Nzz

Nz

N

q

N

l

qq

Tllll

qq

N

ll

qq

qq

phonemefor frames

,))((1

,1

11

December 11, 201223 E.Godoy, Voice Conversion

Page 24: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Still GMM Limitations for VC

Unfortunately, converted speech quality poor original converted

What about the joint statistics in the GMM? Difference between GMM and VQ-type (no joint stats)

transformation?

Q

q

yqn

xq

Q

q

xqn

xxq

yxq

yqn

xq

xw

xxw

1

1

1

)(

: typeVQ

)()(

: GMM

No significant difference! (originally shown [Chen et al; 03])

joint statistics

December 11, 201224 E.Godoy, Voice Conversion

Page 25: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

"Over-Smoothing" in GMM-based VC

The Over-Smoothing Problem: (Explains poor quality!) Transformed spectral envelopes using GMM are "overly-

smooth" Loss of spectral details converted speech "muffled",

"loss of presence"

Cause: (Goes back to those joint statistics…) Low inter-speaker parameter correlation

(weak statistical link exhibited by aligned source & target frames) class mean

Q

q

yqn

xq

Q

q

xqn

xxq

yxq

yqn

xqnn xwxxwxy

11

1)()()()(ˆ

Frame-level alignment of source & target parameters not effective!

December 11, 201225 E.Godoy, Voice Conversion

Page 26: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

“Spectrogram” Example (GMM over-smoothing)

December 11, 2012E.Godoy, Voice Conversion 26

X

Hz0 2000 4000 6000

50

100

150

200

250

Phonetic GMM

Hz0 2000 4000 6000

50

100

150

200

250

DFWA

Hz0 2000 4000 6000

50

100

150

200

250

Y

Hz0 2000 4000 6000

50

100

150

200

250

X

Hz0 2000 4000 6000

50

100

150

200

250

Phonetic GMM

Hz0 2000 4000 6000

50

100

150

200

250

DFWA

Hz0 2000 4000 6000

50

100

150

200

250

Y

Hz0 2000 4000 6000

50

100

150

200

250

Spectral Envelopes across a sentence.

Page 27: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Today’s Lecture: Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 27

I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion

II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model

Proposed: Dynamic Frequency Warping + Amplitude Scaling

Related work DFWA description

Page 28: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Dynamic Frequency Warping (DFW)

Dynamic Frequency Warping [Valbret; 92] Warp source spectral envelope in frequency

to resemble that of target

Alternative to GMM No transformation with joint statistics!

Maintains spectral details higher quality speech

Spectral amplitude not adjusted explicitly poor identity transformation

December 11, 201228 E.Godoy, Voice Conversion

Page 29: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Spectral Envelopes of a Frame: DFW

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

DFW

December 11, 201229 E.Godoy, Voice Conversion

Page 30: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Hybrid DFW-GMM Approaches

December 11, 2012E.Godoy, Voice Conversion 30

Goal to adjust spectral amplitude after DFW

Combine DFW- and GMM- transformed envelopes

Rely on arbitrary smoothing factors [Toda; 01],[Erro; 10]

Impose compromise between 1. maintaining spectral details (via DFW) 2. respecting average trends (via GMM)

Page 31: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Proposed Approach: DFWA

December 11, 2012E.Godoy, Voice Conversion 31

Observation: Do not need the GMM to adjust amplitude!

Alternatively, can use a simple amplitude correction term

Dynamic Frequency Warping with Amplitude Scaling (DFWA)

trendsaverage respects )(

details; spectral contains))((

))(()()(1

1,

fA

fWS

fWSfAfS

q

qx

qx

qndfwa

n

n

Page 32: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Levels of Source-Target Parameter Mappings

(1) Frame-level: Standard (GMM)

(3) Class-level: Global (DFWA)

December 11, 201232 E.Godoy, Voice Conversion

Page 33: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Recall Standard Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 33

Parallel corpora: source & target utter same sentences Alignment of individual source & target frames in time

Mappings between acoustic spaces determined on frame level

Corpora(Parallel)

Extract target parameters

Extract source parameters Converted

Speech synthesis

Align source & target frames LEARNING

TRANSFORMATION

Page 34: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

DFW with Amplitude Scaling (DFWA)

Associate source & target parameters based on class statistics

Unlike typical VC, no frame alignment required in DFWA!

1. Define Acoustic Classes2. DFW3. Amplitude Scaling

Corpora

Clustering

Extract source parameters

Converted Speech synthesis

DFW Amplitude Scaling

Extract target parameters

Clustering

December 11, 201234 E.Godoy, Voice Conversion

Page 35: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Defining Acoustic Classes

Acoustic classes built using clustering, which serves 2 purposes:1. Classify individual source or target frames in acoustic space

2. Associate source & target classes

Choice of clustering approach depends on available information: Acoustic information:

With aligned frames: joint clustering (e.g. joint GMM or VQ) Without aligned frames: independent clustering & associate

classes (e.g. with closest means) Contextual information: use symbolic information (e.g.

phoneme labels)

Outcome: q=1,2…Q acoustic classes in a one-to-one correspondence

December 11, 201235 E.Godoy, Voice Conversion

Page 36: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

DFW Estimation

Compares distributions of observed source & target spectral envelope peak frequencies (e.g. formants)

Global vision of spectral peak behavior (for each class) Only peak locations (and not amplitudes) considered

DFW function: piecewise linear Intervals defined by aligning maxima in spectral peak frequency

distributions– Dijkstra algorithm: min sum abs diff between target & warped source

distributionsq

yjq

xiq Mmff

mm,...,1),,( ,,

xiqmq

yjqmqx

iqxiq

yjq

yjq

mq mm

mm

mm fBfCff

ffB ,,,,

,,

,,,

1

1

mqmqq CfBfW ,,)(

xiq

xiq mmfff

1,, ,

E.Godoy, Voice Conversion

Page 37: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

DFW Estimation

Clear global trends (peak locations)

Avoid sporadic frame-to-frame comparisons

Dijkstra algorithm selects pairs

DFW statistically aligning most probable spectral events (peak locations)

Amplitude scaling then adjusts differences between the target & warped source spectral envelopes

0 1000 2000 3000 4000 5000 6000 7000 80000

0.005

0.01

0.015

Frequency (Hz)

Peak Occurrence Distributions

source

0 1000 2000 3000 4000 5000 6000 7000 80000

0.005

0.01

0.015

Frequency (Hz)

target

0 1000 2000 3000 4000 5000 6000 7000 80000

0.005

0.01

0.015

Frequency (Hz)

Peak Occurrence Distributions

0 1000 2000 3000 4000 5000 6000 7000 80000

0.005

0.01

0.015

Frequency (Hz)

source

target

0 1000 2000 3000 4000 5000 6000 7000 80000

0.005

0.01

0.015

Frequency (Hz)

Peak Occurrence Distributions

0 1000 2000 3000 4000 5000 6000 7000 80000

0.005

0.01

0.015

Frequency (Hz)

source

target

warped source

E.Godoy, Voice Conversion

Page 38: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Spectral Envelopes of a Frame

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

DFW

December 11, 201238 E.Godoy, Voice Conversion

Page 39: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Amplitude Scaling

Goal of amplitude scaling:

Adapt frequency-warped source envelopes to better resemble those of target

estimation using statistics of target and warped source data

For class q, amplitude scaling function defined by:

- Comparison between source & target data strictly on acoustic class-level- Difference in average log spectra

- No arbitrary weighting or smoothing!

)))((log())(log())(log( 1 fWSfSfA qxq

yqq

)( fAq

E.Godoy, Voice Conversion

Page 40: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

DFWA Transformation

Transformation (for source frame n in class q):

In discrete cepstral domain: (log spectrum parameters)

trendsaverage respects )(

details; spectral contains))((

))(()()(1

1,

fA

fWS

fWSfAfS

q

qx

qx

qndfwa

n

n

)))((log( represents , ofmean

))((for tscoefficien cepstral

1

1

fWS

fWS

qx

nq

qx

n

n

n

qnyq

dfwany ˆ

Average target envelope respected ("unbiased estimator")Captures timbre!

Spectral details maintained:difference between frame realization & averageEnsures Quality!

)))((log())(log())(log( 1 fWSfSfA qxq

yqq

December 11, 201240 E.Godoy, Voice Conversion

Page 41: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Spectral Envelopes of a Frame

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

DFW

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

XDFW

DFWA

0 1000 2000 3000 4000 5000 6000 7000 8000-70

-60

-50

-40

-30

-20

-10

0

Hz

dB

Y

X

DFWDFWA

Phonetic GMM

December 11, 201241 E.Godoy, Voice Conversion

Page 42: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Spectral Envelopes across a Sentence

X

Hz0 2000 4000 6000

50

100

150

200

250

Phonetic GMM

Hz0 2000 4000 6000

50

100

150

200

250

DFWA

Hz0 2000 4000 6000

50

100

150

200

250

Y

Hz0 2000 4000 6000

50

100

150

200

250

zoom

December 11, 201242 E.Godoy, Voice Conversion

Page 43: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Spectral Envelopes across a SoundX

Hz0 500 1000 1500 2000 2500 3000 3500

265

270

275

280

285

Phonetic GMM

Hz0 500 1000 1500 2000 2500 3000 3500

265

270

275

280

285

DFWA

Hz0 500 1000 1500 2000 2500 3000 3500

265

270

275

280

285

Y

Hz0 500 1000 1500 2000 2500 3000 3500

265

270

275

280

285

DFWA looks like natural speech can see warping appropriately

shifting source events overall, DFWA more closely

resembles target

Important to note that examining spectral envelope evolution in time very informative!

Can see poor quality with GMM right away (less evident w/in frame)

E.Godoy, Voice Conversion

Page 44: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Today’s Lecture: Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 44

I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion

II. Spectral Envelope Transformation in VC Standard: Gaussian Mixture Model Proposed: Dynamic Frequency Warping +

Amplitude Scaling

III. Conversion Results Objective Metrics & Subjective Evaluations Sound Samples

Page 45: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

E.Godoy, Voice Conversion

Formal Evaluations

Speech Corpora & Analysis CMU ARCTIC, US English speakers (2 male, 2 female) Parallel annotated corpora, speech sampled at 16kHz 200 sentences for learning, 100 sentences for testing Speech analysis & synthesis: Harmonic model All spectral envelopes discrete cepstral coefficients

(order 40)

Spectral envelope transformation methods: Phonetic & Traditional GMMs (diagonal covariance

matrices) DFWA (classification by phoneme) DFWE (Hybrid DFW-GMM, E-energy correction) source (no transformation)

December 11, 201245

Page 46: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Objective Metrics for Evaluation

Mean Squared Error (standard) alone is not sufficient! Does not adequately indicate converted speech quality Does not indicate variance in transformed data

Variance Ratio (VR) Average ratio of the transformed-to-target data variance More global indication of behavior

December 11, 201246 E.Godoy, Voice Conversion

Is the transformed data varying from the mean?

Does the transformed data behave like the target data?

Page 47: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Objective Results

GMM-based methods yield lowest MSE but no variance Confirms over-smoothing

DFWA yields higher MSE but more natural variations in speech VR of DFWA directly from variations in the warped source envelopes (not

model adaptations!)

Hybrid (DFWE) in-between DFWA & GMM (as expected) Compared to DFWA, VR notably decreased after energy correction filter

Different indications from MSE & VR: both important, complementary

E.Godoy, Voice Conversion

Page 48: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

DFWA for VC with Nonparallel Corpora

Parallel Corpora: big constraint ! Limits the data that can be used in VC

DFWA for Nonparallel corpora Nonparallel Learning Data: 200 disjoint sentences

DFWA equivalent for parallel or nonparallel corpora!

MSE (dB) VR

Parallel -7.74 0.930

Nonparallel -7.71 0.936

E.Godoy, Voice Conversion

Page 49: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Implementation in a VC System

Converted speech synthesis Pitch modification using TD-PSOLA

(Time-Domain Pitch Synchronous Overlap Add) Pitch modification factor determined by classic linear

transformation

Frame synthesis voiced frames:

harmonic amplitudes from transformed spectral envelopes

harmonic phases from nearest-neighbor source sampling unvoiced frames: white noise through AR filter

SpeechAnalysis

(Harmonic)

Spectral Envelope

Transformation

Speech FrameSynthesis

(pitch modification)

TD-PSOLA(pitch modification)source

speech

)(tsx )(ˆ tsyconverted

speech

December 11, 201249 E.Godoy, Voice Conversion

xfxxf

yfy

fy ff

0

0

0

0 0ˆ

0

Page 50: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Subjective Testing

Listeners evaluate

1. Quality of speech2. Similarity of

voice to the target speaker

Mean Opinion Score MOS scale from 1-

5

December 11, 2012E.Godoy, Voice Conversion 50

Page 51: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Subjective Test Results

No observable differences between acoustic decoding & phonetic classification

Converted speech quality consistently higher for DFWA than GMM-based

Equal (if not slightly more) success for DFWA in capturing identity

No compromise between quality & capturing identity for DFWA!

DFWA yields better quality

than other methods

~Equal success for similarity

December 11, 201251 E.Godoy, Voice Conversion

Page 52: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Source Target GMM DFWA DFWE

slt clb (FF)

bdl clb (MF)

Conversion Examples

DFWA consistently highest quality GMM-based suffer “loss of presence”

Key observation: DFWA can deliver high-quality VC!

Target analysis-synthesis with converted spectral envelopes

December 11, 201252 E.Godoy, Voice Conversion

Page 53: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

VC Perspectives

Conversion within gender gives better results (e.g. FF) Speakers share similar voice characteristics

For inter-gender conversion, spectral envelope not enough Large pitch modification factors degrade speech Prosody important

Need to address phase spectrum & glottal source too…

December 11, 2012E.Godoy, Voice Conversion 53

Page 54: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Today’s Lecture: Voice Conversion

December 11, 2012E.Godoy, Voice Conversion 54

I. Introduction to Voice Conversion Speech Synthesis Context (TTS) Overview of Voice Conversion

II. Spectral Envelope Transformation in VC Typical: Gaussian Mixture Model Proposed: Dynamic Frequency Warping +

Amplitude Scaling

III. Conversion Results Objective Metrics & Subjective Evaluations Sound Samples

IV. Summary & Conclusions

Page 55: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Summary

Text-to-Speech (TTS)1. Concatenative (unit-selection)2. Parametric (HMM-based speaker models)

Acoustic Parameters in speech1. Segmental (pitch, spectral envelope)2. Supra-segmental (prosody, intonation)

Voice Conversion Modify speech to transform speaker identity Can create new voices! Focus: spectral envelope related to timbre

December 11, 2012E.Godoy, Voice Conversion 55

Page 56: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Summary (continued) Spectral Envelope Transformation

1. Spectral envelope parameterzation (cepstral, LPC, …)2. Level of source-target acoustic mappings3. Method for learning & transformation

1. Standard VC Approach Parallel corpora & source-target frame alignment GMM-based transformation

2. Proposed VC Approach Source-target acoustic mapping on class level Dynamic Frequency Warping + Amplitude Scaling

(DFWA)

December 11, 2012E.Godoy, Voice Conversion 56

Page 57: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Important Considerations Spectral Envelope Transformation

1. Speaker identity need to capture average speaker timbre

2. Speech quality need to maintain spectral details

Source-Target Acoustic Mappings1. Contextual information (Phoneme) helpful2. Mapping features on more global class-level effective

Better conversion quality & more flexible (no parallel corpora)

(Frame-level alignment/mappings too narrow & restrictive)

Speaker characteristics (prosody, voice quality) Success of Voice Conversion can depend on speakers Many features of speech to consider & treat…

December 11, 2012E.Godoy, Voice Conversion 57

Page 58: Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012

Thank You!

Questions?