Kodiranje Govornog Signala Raznim Tehnikama

Embed Size (px)

Citation preview

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    1/104

    1

    Digital Speech ProcessingLecture 17

    Speech Coding Methods Basedon Speech Models

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    2/104

    2

    Waveform Coding versus Block

    Processing

    Waveform coding

    sample-by-sample matching of waveforms

    coding quality measured using SNR

    Source modeling (block processing)

    block processing of signal => vector of outputsevery block

    overlapped blocks

    Block 1

    Block 2

    Block 3

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    3/104

    3

    Model-Based Speech Coding weve carried waveform coding based on optimizing and maximizing

    SNRabout as far as possible

    achieved bit rate reductions on the order of 4:1 (i.e., from 128 KbpsPCM to 32 Kbps ADPCM) at the same time achieving toll quality SNRfor telephone-bandwidth speech

    to lower bit rate further without reducing speech quality, we need toexploit features of the speech production model, including:

    source modeling spectrum modeling

    use of codebook methods for coding efficiency

    we also need a new way of comparing performance of differentwaveform and model-based coding methods

    an objective measure, like SNR, isnt an appropriate measure for model-based coders since they operate on blocks of speech and dont followthe waveform on a sample-by-sample basis

    new subjective measures need to be used that measure user-perceivedquality, intelligibility, and robustness to multiple factors

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    4/104

    4

    Topics Covered in this Lecture Enhancements for ADPCM Coders

    pitch prediction

    noise shaping

    Analysis-by-Synthesis Speech Coders multipulse linear prediction coder (MPLPC)

    code-excited linear prediction (CELP)

    Open-Loop Speech Coders two-state excitation model

    LPC vocoder

    residual-excited linear predictive coder

    mixed excitation systems speech coding quality measures - MOS

    speech coding standards

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    5/104

    5

    ][nx ][nd ][ nd ][nc

    ][~ nx ][ nx

    Differential Quantization

    = =

    p

    k

    k

    zzP 1)(

    ][nc ][ nd ][ nx

    ][~

    nx

    P: simple predictor of

    vocal tract response

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    6/104

    6

    Issues with Differential Quantization difference signal retains the character of

    the excitation signal

    switches back and forth between quasi-periodic and noise-like signals

    prediction duration (even when using

    p=20) is order of 2.5 msec (for samplingrate of 8 kHz) predictor is predicting vocal tract response

    not the excitation period (for voiced sounds)

    Solution incorporate two stages ofprediction, namely a short-time predictorfor the vocal tract response and a long-

    time predictor for pitch period

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    7/104

    7

    Pitch Prediction

    first stage pitch predictor:

    second stage linear predictor (vocal tract predictor):

    1( ) = MP z z

    21( )

    ==

    pk

    kkP z z

    residual

    [ ]x n [ ]x n+ [ ]d n

    +[ ]x n%

    +

    + +[ ]d n

    Transmitter Receiver

    + ++ +

    +

    ++

    ++

    )(1 zP

    )(2 zP )(1 1 zP=

    =

    =

    p

    k

    k

    k

    M

    zzP

    zzP

    12

    1

    )(

    )(

    ][Q

    )(1 zP )(2 zP

    [ ]x n

    )(zPc

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    8/104

    8

    Pitch Prediction

    1( )

    first stage pitch predictor:

    this predictor model assumes that the pitch period, , is an

    integer number of samples and is a gain constant allowing

    for variations in pitc

    =

    MP z z

    M

    1

    1 1 2 3( )

    h period over time (for unvoiced or background

    frames, values of and are irrelevant)

    an alternative (somewhat more complicated) pitch predictor

    is of the form:

    + = + +

    M M

    M

    P z z z z 1

    1

    1

    this more advanced form provides a way to handle non-integer

    pitch period through interpolation around the nearest integer

    pitch period value,

    =

    =

    M M k

    k

    k

    z

    M

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    9/104

    9

    Combined Prediction

    [ ][ ]

    1 2

    1 2

    1 1 1( )

    1 ( ) 1 ( ) 1 ( )

    1 ( ) 1 ( ) 1 ( ) 1 1

    The combined inverse system is the cascade in the decoder system:

    with 2-stage prediction error filter of the form:

    c

    c

    c

    H zP z P z P z

    P z P z P z

    = =

    = =

    [ ]

    [ ]

    [ ]

    1 2 1

    1 2 1

    1 2 1

    ( ) ( ) ( )

    ( ) 1 ( ) ( ) ( )

    1 ( ) ( ) ( )

    [ ]

    [ ]

    which is implemented as a parallel combination of two predictors:

    and

    The prediction signal, can be expressed as

    c

    P z P z P z

    P z P z P z P z

    P z P z P z

    x n

    x n x

    =

    =

    %

    %

    1

    [ ] ( [ ] [ ])p

    k

    k

    n M x n k x n k M =

    +

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    10/104

    10

    Combined Prediction Error

    1

    [ ] [ ] [ ]

    [ ] [ ]

    [ ] [ ] [ ]

    {

    The combined prediction error can be defined as:

    where

    is the prediction error of the pitch predictor.

    The optimal values of , and

    c

    p

    k

    k

    d n x n x n

    v n v n k

    v n x n x n M

    M

    =

    =

    =

    =

    %

    }

    [ ].[ ]

    [ ]

    are obtained, in theory,

    by minimizing the variance of In practice a sub-optimumsolution is obtained by first minimizing the variance of and

    then minimizing the variance of subje

    k

    c

    c

    d n

    v n

    d n

    ct to the chosen values

    of and M

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    11/104

    11

    Solution for Combined Predictor

    ( )22

    1 ( [ ]) [ ] [ ]

    Mean-squared prediction error for pitch predictor is:

    where denotes averaging over a finite frame of speech samples.

    We use the covariance-type of averaging to elimin

    E v n x n x n M= =

    ( )( )

    ( )

    ( ) ( )( ) ( )

    2

    1

    22

    1 2 2

    [ ] [ ]

    [ ]

    ( [ ] [ ] )[ ] 1

    [ ] [ ]

    ate windowing

    effects, giving the solution:

    Using this value of , we solve for as

    with minimum normalized covar

    opt

    opt

    opt

    x n x n M

    x n M

    E

    x n x n ME x n

    x n x n M

    =

    =

    ( ) ( )( )1/2

    2 2

    [ ] [ ][ ]

    [ ] [ ]

    iance:

    =

    x n x n MM

    x n x n M

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    12/104

    12

    Solution for Combined Predictor

    Steps in solution:

    first search forMthat maximizes [M]

    compute opt

    Solve for more accurate pitch predictor byminimizing the variance of the expanded

    pitch predictor

    Solve for optimum vocal tract predictorcoefficients, k, k=1,2,,p

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    13/104

    13

    Pitch Prediction

    Vocal tract

    prediction

    Pitch and

    vocal tract

    prediction

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    14/104

    14

    Noise Shaping in DPCM

    Systems

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    15/104

    15

    Noise Shaping Fundamentals

    [ ] [ ] [ ]

    [ ]

    [ ]

    The output of an ADPCN encoder/decoder is:

    where is the quantization noise. It is easy to

    show that generally has a flat spectrum and thus

    is especially audible in spe

    x n x n e n

    e n

    e n

    = +

    ctral regions of low intensity,

    i.e., between formants.

    This has led to methods of shaping the quantization noise

    to match the speech spectrum and take advantage of spectral

    masking concepts

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    16/104

    16

    Noise Shaping

    Basic ADPCM encoder and decoder

    Equivalent ADPCM encoder and decoder

    Noise shaping ADPCM encoder and decoder

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    17/104

    17

    Noise Shaping[ ] [ ] [ ] ( ) ( ) ( )

    ( ) ( ) ( ) ( ) [1 ( )] ( ) ( ) ( )

    ( ) ( ) (

    The equivalence of parts (b) and (a) is shown by the following:

    From part (a) we see that:

    with

    x n x n e n X z X z E z

    D z X z P z X z P z X z P z E z

    E z D z D z

    = + = +

    = =

    =

    )

    ( ) ( ) ( ) [1 ( )] ( ) [1 ( )] ( )1 ( ) ( ) ( ) ( )

    1 ( )

    1([1 ( )] ( ) [1 ( )] ( ))

    1 ( )

    ( ) ( )

    showing the equivalence of parts (b) and (a). Further, since

    Fee

    D z D z E z P z X z P z E z

    X z H z D z D zP z

    P z X z P z E zP z

    X z E z

    = + = +

    = =

    = +

    = +

    ( ) [ ]

    [ ] [ ]

    ding back the quantization error through the predictor,

    ensures that the reconstructed signal, , differs from

    , by the quantization error, , incurred in quantizing the

    difference signal

    P z x n

    x n e n

    [ ]., d n

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    18/104

    18

    Shaping the Quantization Noise

    [ ]

    ( )

    ( ),

    1 '( ) ( ) '( ) '( )

    1 ( )

    11 ( ) ( ) 1 (

    1 ( )

    To shape the quantization noise we simply replace by

    a different system function, giving the reconstructed signal as:

    P z

    F z

    X z H z D z D z

    P z

    P z X z F P z

    = =

    = +

    [ ]( )) '( )

    1 ( )( ) '( )1 ( )

    [ ]

    '( ) ( ) '( )

    1 ( )'( ) '( ) ( ) '(

    1 ( )

    Thus if is coded by the encoder, the -transform of the

    reconstructed signal at the receiver is:

    z E z

    F zX z E zP z

    x n z

    X z X z E z

    F zE z E z z E z

    P z

    = +

    = +

    = =

    %

    % )

    1 ( )

    ( ) 1 ( )where is the effective noise shaping filter

    F z

    z P z

    =

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    19/104

    19

    Noise Shaping Filter Options

    ( ) 0

    ( ) ( )

    Noise shaping filter options:

    1. and we assume noise has a flat spectrum, then

    the noise and speech spectrum have the same shape

    2. then the equivalent system is the standard

    F z

    F z P z

    =

    =

    1

    1

    '( ) '( ) ( )

    3. ( ) ( )

    DPCM

    system where with flat noise spectrum

    and we ''shape'' the noise spectrum

    to ''hide'' the noise beneath the spectral peaks of the speech signal;

    ea

    pk k

    k

    k

    E z E z E z

    F z P z z

    =

    = =

    = =

    %

    [1 ( )] [1 ( )]ch zero of is paired with a zero of where the paired

    zeros have the same angles in the -plane, but with a radius that is

    divided by .

    P z F z

    z

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    20/104

    20

    Noise Shaping Filter

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    21/104

    21

    Noise Shaping Filter

    2

    '

    2 /2 / 2

    ' '2 /

    ,

    1 ( )( )

    1 ( )

    If we assume that the quantization noise has a flat spectrum with noise

    power of then the power spectrum of the shaped noise is of the form:

    S

    S

    S

    e

    j F Fj F F

    e ej F F

    F eP e

    P e

    =

    Noise

    spectrum

    above

    speechspectrum

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    22/104

    22

    Fully Quantized Adaptive

    Predictive Coder

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    23/104

    23

    Full ADPCM Coder

    Input isx[n]

    P2(z) is the short-term (vocal tract) predictor

    Signal v[n] is the short-term prediction error

    Goal of encoder is to obtain a quantized representation of this excitation

    signal, from which the original signal can be reconstructed.

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    24/104

    24

    Quantized ADPCM CoderTotal bit rate for ADPCM coder:

    where is the number of bits for the quantization of thedifference signal, is the number of bits for encoding the

    step size at frame ra

    ADPCM S P PI BF B F B F

    B

    B

    = + +

    ,

    8000 1 4

    te and is the total number of bits

    allocated to the predictor coefficients (both long and short-term)

    with frame rate

    Typically and even with bits, we need between

    8000 and 3200

    P

    P

    S

    F B

    F

    F B

    =

    bps for quantization of difference signal

    Typically we need about 3000-4000 bps for the side information

    (step size and predictor coefficients)

    Overall we need between 11,000 and 36,000 bps for a fu

    lly quantized system

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    25/104

    25

    Bit Rate for LP Coding

    speech and residual sampling rate: Fs=8 kHz

    LP analysis frame rate: F

    =FP

    = 50-100 frames/sec

    quantizer stepsize: 6 bits/frame

    predictor parameters: M(pitch period): 7 bits/frame

    pitch predictor coefficients: 13 bits/frame

    vocal tract predictor coefficients: PARCORs 16-20, 46-50 bits/frame

    prediction residual: 1-3 bits/sample total bit rate:

    BR = 72*FP+ Fs (minimum)

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    26/104

    26

    Two-Level (B=1 bit) Quantizer

    Prediction residual

    Quantizer input

    Quantizer output

    Reconstructed pitch

    Original pitch residual

    Reconstructed speech

    Original speech

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    27/104

    27

    Three-Level Center-Clipped Quantizer

    Prediction residual

    Quantizer input

    Quantizer output

    Reconstructed pitch

    Original pitch residual

    Reconstructed speech

    Original speech

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    28/104

    28

    Summary of Using LP in Speech

    Coding the predictor can be more sophisticated than a

    vocal tract response predictorcan utilizeperiodicity (for voiced speech frames)

    the quantization noise spectrum can be shapedby noise feedback key concept is to hide the quantization noise under

    the formant peaks in the speech, thereby utilizing theperceptual masking power of the human auditorysystem

    we now move on to more advanced LP coding ofspeech using Analysis-by-Synthesis methods

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    29/104

    29

    Analysis-by-SynthesisSpeech Coders

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    30/104

    30

    A-b-S Speech Coding The key to reducing the data rate of a

    closed-loop adaptive predictive coder wasto force the coded difference signal (the

    input/excitation to the vocal tract model) to

    be more easily represented at low datarates while maintaining very high quality at

    the output of the decoder synthesizer

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    31/104

    31

    A-b-S Speech Coding

    Replace quantizer for generating excitation signal with an

    optimization process (denoted as Error Minimization above)

    whereby the excitation signal, d[n] is constructed based on

    minimization of the mean-squared value of the synthesis error,

    d[n]=x[n]-x[n]; utilizes Perceptual Weighting filter.~

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    32/104

    32

    A-b-S Speech Coding

    [ ] thx n p

    Basic operation of each loop of closed-loop A-b-S system:

    1. at the beginning of each loop (and only once each loop), the

    speech signal, , is used to generate an optimum order

    1

    1

    1 1( )1 ( )

    [ ] [ ] [ ]

    [ ],

    p

    i

    i

    H zP z

    z

    d n x n x n

    x n

    =

    = =

    =

    %

    %

    LPC filter of the form:

    2. the difference signal, , based on an initial estimate

    of the speech signal, is perceptually

    1 ( )( )

    1 ( )

    P zW z

    P z

    =

    weighted by a speech-adaptive

    filter of the form:

    (see next vugraph)

    3. the error minimization box and the excitation generator create a sequence

    [ ]d n

    of error signals that iteratively (once per loop) improve the match to theweighted error signal

    4. the resulting excitation signal, , which is an improved estimate of the

    actual LPC prediction error signal for each loop iteration, is used to excite the

    LPC filter and the loop processing is iterated until the resulting error signal

    meets some criterion for stopping the closed-loop iterations.

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    33/104

    33

    Perceptual Weighting Function

    1 ( )

    ( ) 1 ( )

    P z

    W z P z

    =

    As approaches 1, weighting is flat; as approaches 0, weightingbecomes inverse frequency response of vocal tract.

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    34/104

    34

    Perceptual Weighting

    11 2

    2

    1 ( )( ) , 0 1

    1 ( )

    Perceptual weighting filter often modified to form:

    so as to make the perceptual weighting be lesssensitive to the detailed frequency response of the

    vocal trac

    =

    P zW z

    P z

    t filter

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    35/104

    35

    Implementation of A-B-S Speech

    Coding

    Goal: find a representation of the excitation for

    the vocal tract filter that produces high qualitysynthetic output, while maintaining a structured

    representation that makes it easy to code the

    excitation at low data rates Solution: use a set of basis functions which

    allow you to iteratively build up an optimal

    excitation function is stages, by adding a newbasis function at each iteration in the A-b-S

    process

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    36/104

    36

    Implementation of A-B-S Speech Coding

    1 2{ [ ], [ ],..., [ ]}, 0 1Q

    Q

    f n f n f n n L =

    Assume we are given a set of basis functions of the form:

    and each basis function is 0 outside the defining interval.

    At each iteration of the A-b-S loop, we

    ( )1

    2

    0

    :

    [ ] [ ] [ ] [ ]

    [ ] [ ]

    L

    n

    E

    E x n d n h n w n

    h n w n

    =

    =

    select the basis function

    from that maximally reduces the perceptually weighted mean-

    squared error,

    where and are the VT and perceptual weighting filter[ ],

    [ ] [ ]

    [ ].

    k

    k

    k

    th

    k k k

    k f n

    d n f n

    f n

    =

    s.We denote the optimal basis function at the iteration as

    giving the excitation signal where is the optimal

    weighting coefficient for basis function

    The A-b-S

    [ ]N d n

    d

    iteration continues until the perceptually weighted error

    falls below some desired threshold, or until a maximum number of

    iterations, , is reached, giving the final excitation signal, , as:

    1

    [ ] [ ]k

    N

    k

    k

    n f n

    =

    =

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    37/104

    37

    Implementation of A-B-S Speech Coding

    Closed Loop Coder

    Reformulated Closed Loop Coder

  • 7/30/2019 Kodiranje Govornog Signala Raznim Tehnikama

    38/104

    38

    Implementation of A-B-S Speech Coding

    0

    0

    [ ] 0

    0 [ ]

    [ ] 0[ ]0 0 1

    Assume that is known up to current frame ( for simplicity)

    Initialize estimate of the excitation, as:

    Form the initial estimate of the speech signal

    th

    d n n

    d n

    d n nd nn L

    =