35
Page 1 of 34 MBE Vocoders MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Embed Size (px)

Citation preview

Page 1: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 1 of 34

MBE VocodersMBE Vocoders

Nima MoghadamSaeed Nari

Supervisor

Dr. Saameti

April 2005Sharif University of Technology

Page 2: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 2 of 34

OutlineOutline

Introduction to vocodersMBE vocoder

– MBE Parameters– Parameter estimation– Analysis and synthesis algorithm

AMBEIMBE

Page 3: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 3 of 34

Vocoders - analyzerVocoders - analyzer

1. Speech analyzed first by segmenting speech using a window (e.g. Hamming window)

2. Excitation and system parameters are calculated for each segment

1. Excitation parameters : voiced/unvoiced, pitch period

2. System parameters: spectral envelope / system impulse response

3. Sending this parameters

Page 4: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 4 of 34

Vocoders - SynthesizerVocoders - Synthesizer

System

parameters

Excitation Signal

White noise/ unvoiced

Pulse train/voiced

Synthesized voice

Page 5: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 5 of 34

VocodersVocoders

But usually vocoders have poor quality– Fundamental limitation in speech models– Inaccurate parameter estimation– Incapability of pulse train/ white noise to produce all voice

• speech synthesized entirely with a periodic source exhibits a “buzzy” quality, and speech synthesized entirely with a noise source exhibits a “hoarse” quality

Potential solution to buzziness of vocoders is to use of mixed excitation models

In these vocoders periodic and noise like excitations are mixed with a calculated ratio and this ration will be sent along the parameters

Page 6: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 6 of 34

Multi Band Excitation Speech Multi Band Excitation Speech ModelModel

Due to stationary nature of a speech signal, a window w(n) is usually applied to signal

The Fourier transform of a windowed segment can be modeled as the product of a spectral envelope and an excitation spectrum

In most models is a smoothed version of the original speech spectrum

)(ws)(wH

)()()( nsnwnsw

|)(| wE

|)(|)()(ˆ www EHs

)(wH )(ws

Page 7: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 7 of 34

MBE model (Cont’d)MBE model (Cont’d)

the spectral envelope must be represented accurately enough to prevent degradations in the spectral envelope from dominating.– quality improvements achieved by the addition of a frequency

dependent voiced/unvoiced mixture function. In previous simple models, the excitation spectrum is totally

specified by the fundamental frequency w0 and a voiced/unvoiced decision for the entire spectrum.

In MBE model, the excitation spectrum is specified by the fundamental frequency w0 and a frequency dependent voiced/unvoiced mixture function.

Page 8: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 8 of 34

Multi BandingMulti Banding

In general, a continuously varying frequency dependent voiced/unvoiced mixture function would require a large number of parameters to represent it accurately. The addition of a large number of parameters would severely decrease the utility of this model in such applications as bit-rate reduction.

To further reduce the number of these binary parameters, the spectrum is divided into multiple frequency bands and a binary voiced/unvoiced parameter is allocated to each band.

MBE model differs from previous models in that the spectrum is divided into a large number of frequency bands (typically 20 or more), whereas previous models used three frequency bands at most .

Page 9: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 9 of 34

Multi BandingMulti Banding

Original

spectrum

Spectral

envelope

Periodic

spectrum

V/UV

information

Noise

spectrum

Excitation

spectrum

Synthetic

spectrum

Page 10: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 10 of 34

MBE ParametersMBE Parameters

The parameters used in MBE model are:1. spectral envelope2. the fundamental frequency 3. the V/UV information for each harmonic 4. and the phase of each harmonic declared

voiced. The phases of harmonics in frequency bands declared unvoiced are not included since they are not required by the synthesis algorithm

Page 11: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 11 of 34

Parameter EstimationParameter Estimation

In many approaches (LPC based algorithms) the algorithms for estimation of excitation parameters and estimation of spectral envelope parameters operate independently.

These parameters are usually estimated based on heuristic criterion without explicit consideration of how close the synthesized speech will be to the original speech.

– This can result in a synthetic spectrum quite different from the original spectrum.

In MBE the excitation and spectral envelope parameters are estimated simultaneously so that the synthesized spectrum is closest in the least squares sense to the spectrum of the original speech “analysis by synthesis”

Page 12: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 12 of 34

Parameter Estimation (Cont’d)Parameter Estimation (Cont’d)

the estimation process has been divided into two major steps.

1. In the first step, the pitch period and spectral envelope parameters are estimated to minimize the error between the original spectrum and the synthetic spectrum.

2. Then, the V/UV decisions are made based on the closeness of fit between the original and the synthetic spectrum at each harmonic of the estimated fundamental.

Page 13: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 13 of 34

Parameter Estimation (cont’d)Parameter Estimation (cont’d)

The parameters estimated by minimizing the following error criterion:

– Where

The error in an interval

is minimized at:

dss ww

2

)(ˆ)(2

1

|)(|)()(ˆ www EHs

dEAsm

m

b

a

Wmwm

2

)()(2

1

dEw

dEwSw

Am

m

m

m

b

a

b

am 2

)(

)()(

Page 14: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 14 of 34

Pitch Estimation and Spectral Pitch Estimation and Spectral EnvelopeEnvelope

An efficient method for obtaining a good approximation for the periodic transform P ( w ) in this interval is to precompute samples of the Fourier transform of the window w (n) and center it around the harmonic frequency associated with this interval.

For unvoiced frequency intervals, the envelope parameters are estimated by substituting idealized white noise (unity across the band) for |E (a)| in previous formulas which reduces to averaging the original spectrum in each frequency interval.

For unvoiced regions, only the magnitude of A, is estimated since the phase of A, is not required for speech synthesis.

Page 15: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 15 of 34

More about pitch estimationMore about pitch estimation

Experimentally, the error E tends to vary slowly with the pitch period P

the initial estimate is obtained by evaluating the error for integer pitch periods

Since integer multiples of the correct pitch period have spectra with harmonics at the correct frequencies, the error E will be comparable for the correct pitch period and its integer multiples.

Page 16: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 16 of 34

More about pitch estimation More about pitch estimation (Cont’d)(Cont’d)

Speech

segment

Original

spectrum

Error/Pitch

Original and

Synthetic

P=42.48

Original and

Synthetic

P=42

Page 17: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 17 of 34

V/UV DecisionV/UV Decision

The voiced/unvoiced decision for each harmonic is made by comparing the normalized error over each harmonic of the estimated fundamental to a threshold

When the normalized error over mth harmonic is below the threshold, this frame will be marked as voiced else unvoiced

dSwm

m

b

a

mm 2

)(2

1

Page 18: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 18 of 34

Analysis Algorithm FlowchartAnalysis Algorithm Flowchart

Window

Speech

segment

start

Compute error vs. pitch period

Autocorrelation approach

Select initial pitch period

(Dynamic programming

Pitch tracker)

Refine initial pitch period

(frequency domain approach)

Make V/UV decision for each

Frequency band

Select V/UV spectral

Envelope parameters

For each freq. band

Stop

Page 19: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 19 of 34

Speech SynthesisSpeech Synthesis

The voiced signal can be synthesized as the sum of sinusoidal oscillators with frequencies at the harmonics of the fundamental and amplitudes set by the spectral envelope parameters (The time domain method).

The unvoiced signal can be synthesized as the sum of bandpass filtered white noise

The frequency domain method was selected for synthesizing the unvoiced portion of the synthetic speech.

Page 20: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 20 of 34

Synthesis algorithm block diagramSynthesis algorithm block diagram

Separate

Voiced/Unvoiced

Envelope samples

Bank of

Harmonic

oscillators

STFT Replace

envelope

Weighted

Overlap-add

Linear

interpolation

V/UV

Decision

Envelope

samples

Voiced envelope

samples

Unvoiced envelope

samples

Voiced envelope

samples

Unvoiced envelope

samples

Voiced

speech

Unvoiced envelope

samples

White noise

sequence

Unvoiced

speech

Page 21: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 21 of 34

MBE Synthesis algorithmMBE Synthesis algorithm

First, the spectral envelope samples are separated into voiced or unvoiced spectral envelope samples depending on whether they are in frequency bands declared voiced or unvoiced

Voiced envelope samples include both magnitude and phase, whereas unvoiced envelope samples include only the magnitude.

Voiced speech is synthesized from the voiced envelope samples by summing the outputs of a band of sinusoidal oscillators running at the harmonics of the fundamental frequency

m

mmv ttAts ))(cos()()(ˆ

Page 22: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 22 of 34

MBE Synthesis algorithm (Voiced)MBE Synthesis algorithm (Voiced)

The phase function is determined by an initial phase and a frequency track as follows:

The frequency track is linearly interpolated between the mth harmonic of the current frame and that of the next frame by:

m

0 )(tm

0

0

)()( t

mm dt

)(tm

mm S

tSm

S

tSmt

)(

)()0()( 00

Page 23: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 23 of 34

MBE Synthesis algorithm MBE Synthesis algorithm (Unvoiced)(Unvoiced)

Unvoiced speech is synthesized from the unvoiced envelope samples by first synthesizing a white noise sequence.

For each frame, the white noise sequence is windowed and an FFT is applied to produce samples of the Fourier transform

In each unvoiced frequency band, the noise transform samples are normalized to have unity magnitude. The unvoiced spectral envelope is constructed by linearly interpolating between the envelope samples |Am(t)|.

The normalized noise transform is multiplied by the spectral envelope to produce the synthetic transform. The synthetic transforms are then used to synthesize unvoiced speech using the weighted overlap-add method.

Page 24: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 24 of 34

MBE Synthesis (Cont’d)MBE Synthesis (Cont’d)

The final synthesized speech is generated by summing the voiced and unvoiced synthesized speech signals

+Synthesized

speech

Voiced

speech

Unvoiced

speech

Page 25: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 25 of 34

Bit AllocationBit Allocation

Parameter Bits

Fundamental Frequency

9

Harmonic

Magnitude

139-94

Harmonic

Phase

0-45

Voiced/Unvoiced

Bits

12

Total 160

Page 26: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 26 of 34

Advanced MBE (AMBE)Advanced MBE (AMBE)

MBE coding rate at 2400 bps AMBE coding rate at 1200/2400 bps Four new features

1. Enhanced V/UV decision

2. Initial pitch detection

3. Refined pitch determination

4. Dual rate coding

Page 27: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 27 of 34

Enhanced V/UV decisionEnhanced V/UV decision

divide the whole speech frequency band into 4 subbands and 2 subbands for 2.4 kbps and 1.2 kbps respectively.

That is to say only 4 bits and 2 bits are used to encode U/V decisions for 2.4 kb/s and 1.2 kb/s vocoder respectively.

Page 28: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 28 of 34

Initial pitch detectionInitial pitch detection

MBE takes 2 steps to detect the refined initial pitch period– Spectrum matching technique to find the initial pitch period

– Using DTW-based (Discrete Time Wrapping) technique to smooth the estimation

Computational complexity is very high In MBE, a modified three-level center clipped auto-

correlation method is used to detect the initial pitch period, and also use a simple smoothing method to correct the pitch errors.

Page 29: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 29 of 34

Redefined pitch determinationRedefined pitch determination

To find the best pitch the basic method is to compute the error between the original speech spectrum and the shaped voiced speech spectrum by first supposing a pitch period

The supposed pitch of which the spectrum error is minimum is chosen as the last pitch

To reduce the computational complexity, AMBE uses a 256- point FFT to get the speech spectrum, and 5-point window spectrum is used to form the voiced harmonic spectrum.

To get the refined pitch, AMBE perform seven times of spectrum matching process. In every time. AMBE first set a supposed pitch, then shape a harmonic spectrum over the overall frequency band according to the supposed pitch and window spectrum, and an error can be calculated by subtracting the shaped spectrum from speech spectrum. After the seven times of matching process, the refined pitch can easily be determined

Page 30: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 30 of 34

Dual rate coding Dual rate coding

Parameter 2400 bps 1200 bps

Pitch quantization

8 6

V/UV decision 4 2

Amplitude quantization

41 19

total 53 27

Page 31: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 31 of 34

Improved MBE (IMBE)Improved MBE (IMBE)

A 2400 bps coder based on MBESubstantially better than U.S government

standard LPC-10eThe parameters of the MBE speech model :

– the fundamental frequency– voiced/unvoiced information – the spectral envelope.

Page 32: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 32 of 34

IMBE algorithmIMBE algorithm

estimate the excitation and system parameters which minimize the distance between the original and synthetic speech spectra (analysis by synthesis)

Once these parameters are estimated, voiced/unvoiced decisions are made by comparing the spectral error over a series of harmonics to a prescribed threshold

Page 33: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 33 of 34

IMBE block diagramIMBE block diagram

IMBE algorithm block diagram

Page 34: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 34 of 34

IMBE Coding IMBE Coding

IMBE offered in 2.4, 4.8 and 8.0 kbps Analysis and synthesis routines are the same

except the bit allocation The fundamental frequency needs accuracy of

about l Hz. and requires about 9 bits per frame. The V/UV decisions are encoded with one bit

per decision. The remaining bits are allocated to error

control and the spectral envelope information.

Page 35: Page 0 of 34 MBE Vocoders Nima Moghadam Saeed Nari Supervisor Dr. Saameti April 2005 Sharif University of Technology

Page 35 of 34

Any Question?Any Question?

Thanks!