Kodiranje Govornog Signala Raznim Tehnikama

7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

1/104

1

Digital Speech ProcessingLecture 17

Speech Coding Methods Basedon Speech Models


2/104

2

Waveform Coding versus Block

Processing

Waveform coding

sample-by-sample matching of waveforms

coding quality measured using SNR

Source modeling (block processing)

block processing of signal => vector of outputsevery block

overlapped blocks

Block 1

Block 2

Block 3


3/104

3

Model-Based Speech Coding weve carried waveform coding based on optimizing and maximizing

SNRabout as far as possible

achieved bit rate reductions on the order of 4:1 (i.e., from 128 KbpsPCM to 32 Kbps ADPCM) at the same time achieving toll quality SNRfor telephone-bandwidth speech

to lower bit rate further without reducing speech quality, we need toexploit features of the speech production model, including:

source modeling spectrum modeling

use of codebook methods for coding efficiency

we also need a new way of comparing performance of differentwaveform and model-based coding methods

an objective measure, like SNR, isnt an appropriate measure for model-based coders since they operate on blocks of speech and dont followthe waveform on a sample-by-sample basis

new subjective measures need to be used that measure user-perceivedquality, intelligibility, and robustness to multiple factors


4/104

4

Topics Covered in this Lecture Enhancements for ADPCM Coders

pitch prediction

noise shaping

Analysis-by-Synthesis Speech Coders multipulse linear prediction coder (MPLPC)

code-excited linear prediction (CELP)

Open-Loop Speech Coders two-state excitation model

LPC vocoder

residual-excited linear predictive coder

mixed excitation systems speech coding quality measures - MOS

speech coding standards


5/104

5

][nx ][nd ][ nd ][nc

][~ nx ][ nx

Differential Quantization

= =

p

k

k

zzP 1)(

][nc ][ nd ][ nx

][~

nx

P: simple predictor of

vocal tract response


6/104

6

Issues with Differential Quantization difference signal retains the character of

the excitation signal

switches back and forth between quasi-periodic and noise-like signals

prediction duration (even when using

p=20) is order of 2.5 msec (for samplingrate of 8 kHz) predictor is predicting vocal tract response

not the excitation period (for voiced sounds)

Solution incorporate two stages ofprediction, namely a short-time predictorfor the vocal tract response and a long-

time predictor for pitch period


7/104

7

Pitch Prediction

first stage pitch predictor:

second stage linear predictor (vocal tract predictor):

1( ) = MP z z

21( )

==

pk

kkP z z

residual

[ ]x n [ ]x n+ [ ]d n

+[ ]x n%

+

+ +[ ]d n

Transmitter Receiver

+ ++ +

+

++

++

)(1 zP

)(2 zP )(1 1 zP=

=

=

p

k

k

k

M

zzP

zzP

12

1

)(

)(

][Q

)(1 zP )(2 zP

[ ]x n

)(zPc


8/104

8

Pitch Prediction

1( )

first stage pitch predictor:

this predictor model assumes that the pitch period, , is an

integer number of samples and is a gain constant allowing

for variations in pitc

=

MP z z

M

1

1 1 2 3( )

h period over time (for unvoiced or background

frames, values of and are irrelevant)

an alternative (somewhat more complicated) pitch predictor

is of the form:

+ = + +

M M

M

P z z z z 1

1

1

this more advanced form provides a way to handle non-integer

pitch period through interpolation around the nearest integer

pitch period value,

=

=

M M k

k

k

z

M


9/104

9

Combined Prediction

[ ][ ]

1 2

1 2

1 1 1( )

1 ( ) 1 ( ) 1 ( )

1 ( ) 1 ( ) 1 ( ) 1 1

The combined inverse system is the cascade in the decoder system:

with 2-stage prediction error filter of the form:

c

c

c

H zP z P z P z

P z P z P z

= =

= =

[ ]

[ ]

[ ]

1 2 1

1 2 1

1 2 1

( ) ( ) ( )

( ) 1 ( ) ( ) ( )

1 ( ) ( ) ( )

[ ]

[ ]

which is implemented as a parallel combination of two predictors:

and

The prediction signal, can be expressed as

c

P z P z P z

P z P z P z P z

P z P z P z

x n

x n x

=

=

%

%

1

[ ] ( [ ] [ ])p

k

k

n M x n k x n k M =

+


10/104

10

Combined Prediction Error

1

[ ] [ ] [ ]

[ ] [ ]

[ ] [ ] [ ]

{

The combined prediction error can be defined as:

where

is the prediction error of the pitch predictor.

The optimal values of , and

c

p

k

k

d n x n x n

v n v n k

v n x n x n M

M

=

=

=

=

%

}

[ ].[ ]

[ ]

are obtained, in theory,

by minimizing the variance of In practice a sub-optimumsolution is obtained by first minimizing the variance of and

then minimizing the variance of subje

k

c

c

d n

v n

d n

ct to the chosen values

of and M


11/104

11

Solution for Combined Predictor

( )22

1 ( [ ]) [ ] [ ]

Mean-squared prediction error for pitch predictor is:

where denotes averaging over a finite frame of speech samples.

We use the covariance-type of averaging to elimin

E v n x n x n M= =

( )( )

( )

( ) ( )( ) ( )

2

1

22

1 2 2

[ ] [ ]

[ ]

( [ ] [ ] )[ ] 1

[ ] [ ]

ate windowing

effects, giving the solution:

Using this value of , we solve for as

with minimum normalized covar

opt

opt

opt

x n x n M

x n M

E

x n x n ME x n

x n x n M

=

=

( ) ( )( )1/2

2 2

[ ] [ ][ ]

[ ] [ ]

iance:

=

x n x n MM

x n x n M


12/104

12

Solution for Combined Predictor

Steps in solution:

first search forMthat maximizes [M]

compute opt

Solve for more accurate pitch predictor byminimizing the variance of the expanded

pitch predictor

Solve for optimum vocal tract predictorcoefficients, k, k=1,2,,p


13/104

13

Pitch Prediction

Vocal tract

prediction

Pitch and

vocal tract

prediction


14/104

14

Noise Shaping in DPCM

Systems


15/104

15

Noise Shaping Fundamentals

[ ] [ ] [ ]

[ ]

[ ]

The output of an ADPCN encoder/decoder is:

where is the quantization noise. It is easy to

show that generally has a flat spectrum and thus

is especially audible in spe

x n x n e n

e n

e n

= +

ctral regions of low intensity,

i.e., between formants.

This has led to methods of shaping the quantization noise

to match the speech spectrum and take advantage of spectral

masking concepts


16/104

16

Noise Shaping

Basic ADPCM encoder and decoder

Equivalent ADPCM encoder and decoder

Noise shaping ADPCM encoder and decoder


17/104

17

Noise Shaping[ ] [ ] [ ] ( ) ( ) ( )

( ) ( ) ( ) ( ) [1 ( )] ( ) ( ) ( )

( ) ( ) (

The equivalence of parts (b) and (a) is shown by the following:

From part (a) we see that:

with

x n x n e n X z X z E z

D z X z P z X z P z X z P z E z

E z D z D z

= + = +

= =

=

)

( ) ( ) ( ) [1 ( )] ( ) [1 ( )] ( )1 ( ) ( ) ( ) ( )

1 ( )

1([1 ( )] ( ) [1 ( )] ( ))

1 ( )

( ) ( )

showing the equivalence of parts (b) and (a). Further, since

Fee

D z D z E z P z X z P z E z

X z H z D z D zP z

P z X z P z E zP z

X z E z

= + = +

= =

= +

= +

( ) [ ]

[ ] [ ]

ding back the quantization error through the predictor,

ensures that the reconstructed signal, , differs from

, by the quantization error, , incurred in quantizing the

difference signal

P z x n

x n e n

[ ]., d n


18/104

18

Shaping the Quantization Noise

[ ]

( )

( ),

1 '( ) ( ) '( ) '( )

1 ( )

11 ( ) ( ) 1 (

1 ( )

To shape the quantization noise we simply replace by

a different system function, giving the reconstructed signal as:

P z

F z

X z H z D z D z

P z

P z X z F P z

= =

= +

[ ]( )) '( )

1 ( )( ) '( )1 ( )

[ ]

'( ) ( ) '( )

1 ( )'( ) '( ) ( ) '(

1 ( )

Thus if is coded by the encoder, the -transform of the

reconstructed signal at the receiver is:

z E z

F zX z E zP z

x n z

X z X z E z

F zE z E z z E z

P z

= +

= +

= =

%

% )

1 ( )

( ) 1 ( )where is the effective noise shaping filter

F z

z P z

=


19/104

19

Noise Shaping Filter Options

( ) 0

( ) ( )

Noise shaping filter options:

1. and we assume noise has a flat spectrum, then

the noise and speech spectrum have the same shape

2. then the equivalent system is the standard

F z

F z P z

=

=

1

1

'( ) '( ) ( )

3. ( ) ( )

DPCM

system where with flat noise spectrum

and we ''shape'' the noise spectrum

to ''hide'' the noise beneath the spectral peaks of the speech signal;

ea

pk k

k

k

E z E z E z

F z P z z

=

= =

= =

%

[1 ( )] [1 ( )]ch zero of is paired with a zero of where the paired

zeros have the same angles in the -plane, but with a radius that is

divided by .

P z F z

z


20/104

20

Noise Shaping Filter


21/104

21

Noise Shaping Filter

2

'

2 /2 / 2

' '2 /

,

1 ( )( )

1 ( )

If we assume that the quantization noise has a flat spectrum with noise

power of then the power spectrum of the shaped noise is of the form:

S

S

S

e

j F Fj F F

e ej F F

F eP e

P e

=

Noise

spectrum

above

speechspectrum


22/104

22

Fully Quantized Adaptive

Predictive Coder


23/104

23

Full ADPCM Coder

Input isx[n]

P2(z) is the short-term (vocal tract) predictor

Signal v[n] is the short-term prediction error

Goal of encoder is to obtain a quantized representation of this excitation

signal, from which the original signal can be reconstructed.


24/104

24

Quantized ADPCM CoderTotal bit rate for ADPCM coder:

where is the number of bits for the quantization of thedifference signal, is the number of bits for encoding the

step size at frame ra

ADPCM S P PI BF B F B F

B

B

= + +

,

8000 1 4

te and is the total number of bits

allocated to the predictor coefficients (both long and short-term)

with frame rate

Typically and even with bits, we need between

8000 and 3200

P

P

S

F B

F

F B

=

bps for quantization of difference signal

Typically we need about 3000-4000 bps for the side information

(step size and predictor coefficients)

Overall we need between 11,000 and 36,000 bps for a fu

lly quantized system


25/104

25

Bit Rate for LP Coding

speech and residual sampling rate: Fs=8 kHz

LP analysis frame rate: F

=FP

= 50-100 frames/sec

quantizer stepsize: 6 bits/frame

predictor parameters: M(pitch period): 7 bits/frame

pitch predictor coefficients: 13 bits/frame

vocal tract predictor coefficients: PARCORs 16-20, 46-50 bits/frame

prediction residual: 1-3 bits/sample total bit rate:

BR = 72*FP+ Fs (minimum)


26/104

26

Two-Level (B=1 bit) Quantizer

Prediction residual

Quantizer input

Quantizer output

Reconstructed pitch

Original pitch residual

Reconstructed speech

Original speech


27/104

27

Three-Level Center-Clipped Quantizer

Prediction residual

Quantizer input

Quantizer output

Reconstructed pitch

Original pitch residual

Reconstructed speech

Original speech


28/104

28

Summary of Using LP in Speech

Coding the predictor can be more sophisticated than a

vocal tract response predictorcan utilizeperiodicity (for voiced speech frames)

the quantization noise spectrum can be shapedby noise feedback key concept is to hide the quantization noise under

the formant peaks in the speech, thereby utilizing theperceptual masking power of the human auditorysystem

we now move on to more advanced LP coding ofspeech using Analysis-by-Synthesis methods


29/104

29

Analysis-by-SynthesisSpeech Coders


30/104

30

A-b-S Speech Coding The key to reducing the data rate of a

closed-loop adaptive predictive coder wasto force the coded difference signal (the

input/excitation to the vocal tract model) to

be more easily represented at low datarates while maintaining very high quality at

the output of the decoder synthesizer


31/104

31

A-b-S Speech Coding

Replace quantizer for generating excitation signal with an

optimization process (denoted as Error Minimization above)

whereby the excitation signal, d[n] is constructed based on

minimization of the mean-squared value of the synthesis error,

d[n]=x[n]-x[n]; utilizes Perceptual Weighting filter.~


32/104

32

A-b-S Speech Coding

[ ] thx n p

Basic operation of each loop of closed-loop A-b-S system:

1. at the beginning of each loop (and only once each loop), the

speech signal, , is used to generate an optimum order

1

1

1 1( )1 ( )

[ ] [ ] [ ]

[ ],

p

i

i

H zP z

z

d n x n x n

x n

=

= =

=

%

%

LPC filter of the form:

2. the difference signal, , based on an initial estimate

of the speech signal, is perceptually

1 ( )( )

1 ( )

P zW z

P z

=

weighted by a speech-adaptive

filter of the form:

(see next vugraph)

3. the error minimization box and the excitation generator create a sequence

[ ]d n

of error signals that iteratively (once per loop) improve the match to theweighted error signal

4. the resulting excitation signal, , which is an improved estimate of the

actual LPC prediction error signal for each loop iteration, is used to excite the

LPC filter and the loop processing is iterated until the resulting error signal

meets some criterion for stopping the closed-loop iterations.


33/104

33

Perceptual Weighting Function

1 ( )

( ) 1 ( )

P z

W z P z

=

As approaches 1, weighting is flat; as approaches 0, weightingbecomes inverse frequency response of vocal tract.


34/104

34

Perceptual Weighting

11 2

2

1 ( )( ) , 0 1

1 ( )

Perceptual weighting filter often modified to form:

so as to make the perceptual weighting be lesssensitive to the detailed frequency response of the

vocal trac

=

P zW z

P z

t filter


35/104

35

Implementation of A-B-S Speech

Coding

Goal: find a representation of the excitation for

the vocal tract filter that produces high qualitysynthetic output, while maintaining a structured

representation that makes it easy to code the

excitation at low data rates Solution: use a set of basis functions which

allow you to iteratively build up an optimal

excitation function is stages, by adding a newbasis function at each iteration in the A-b-S

process


36/104

36

Implementation of A-B-S Speech Coding

1 2{ [ ], [ ],..., [ ]}, 0 1Q

Q

f n f n f n n L =

Assume we are given a set of basis functions of the form:

and each basis function is 0 outside the defining interval.

At each iteration of the A-b-S loop, we

( )1

2

0

:

[ ] [ ] [ ] [ ]

[ ] [ ]

L

n

E

E x n d n h n w n

h n w n

=

=

select the basis function

from that maximally reduces the perceptually weighted mean-

squared error,

where and are the VT and perceptual weighting filter[ ],

[ ] [ ]

[ ].

k

k

k

th

k k k

k f n

d n f n

f n

=

s.We denote the optimal basis function at the iteration as

giving the excitation signal where is the optimal

weighting coefficient for basis function

The A-b-S

[ ]N d n

d

iteration continues until the perceptually weighted error

falls below some desired threshold, or until a maximum number of

iterations, , is reached, giving the final excitation signal, , as:

1

[ ] [ ]k

N

k

k

n f n

=

=


37/104

37


Closed Loop Coder

Reformulated Closed Loop Coder


38/104

38


0

0

[ ] 0

0 [ ]

[ ] 0[ ]0 0 1

Assume that is known up to current frame ( for simplicity)

Initialize estimate of the excitation, as:

Form the initial estimate of the speech signal

th

d n n

d n

d n nd nn L

=

Documents

Kodiranje Govornog Signala Raznim Tehnikama