Speech Coding 2014

Embed Size (px)

DESCRIPTION

Speech coding lecture

Citation preview

  • 11

    Speech Coding

    Speech coders

    Source codersWaveform coders

    LPC VocodersFrequency domainTime domain

    Nondifferential Differential SBC ATC

    APCCVSDM

    ADPCMDeltaPCM

    2

    PCM (Pulse Code Modulation)

    Here, analog signals are quantized in homogeneous steps similar to the usual A/D conversion

    It does not compress the information rate, since it does not use speech-specific characteristics; based on the statistical characteristics of speech amplitude

    The quantization bits B must satisfy: orWhere is the quantization step size, and L is the range of signal amplitude

    The number of bits B must be decided so that the SNR of the quantized signal is larger than that of the signal before quantization

    PCM used in the ordinary telephone system called log PCM, since the amplitude is compressed by logarithmic transformation before linear quantization and coding

    L2B )L(logB 2

  • 23

    PCM

    Since the amplitude of a speech signal has an exponential distribution, the occurrence probability for each bit is equalized by the logarithmic transformation

    Two types of transformation expressions that control the compression:

    -law: Logarithmic compressor used in North American

    telecommunications systems

    Has an input-output magnitude characteristics of the form:

    where |s| is the magnitude of the input, |y| is the magnitude of the output, and is a parameter that is selected to give the desired compression characteristics

    ( )( )

    +

    +=

    11

    loglog sy

    4

    PCM

    The larger the value of , the larger the amount of compression

    Typically is chosen between 100 and 500. = 255 is chosen as a standard encoding of speech waveforms, in US and Canada

    A-law: Logarithmic compressor used in European telecommunications

    systems

    Has an input-output magnitude characteristics of the form:

    where |s| is the magnitude of the input, |y| is the magnitude of the output, and A is a parameter that is selected to give the desired compression characteristics

    AsAy

    log log

    +

    +=

    11

  • 35

    PCM

    A = 87.56 is chosen as a standard encoding of speech waveforms Even the two compression characteristics are different non-linear

    functions, the characteristics are very similar.

    6

    Differential PCM

    This method based on the predictive coding:

    Information compression can be achieved by coding the difference between adjacent samples or the difference between the actual sample value and the predicted value calculated using the correlation (prediction residual);

    Can be done since a speech signal has a correlation between adjacent samples as well as between distant sample;

    The quantization bits can be reduced, because the difference and prediction residual have a smaller range of variation and smaller mean of energy than that of the original signal.

    predictor

    +

    +

    |(i)|

    (n) To D/A converter

    )(ne

    -

    quantizer

    predictor

    +

    +To channel

    )(~ ne

    )(~ ns)(~ ns

    )(ns

  • 47

    Differential PCM

    When the prediction is performed according to linear prediction, the prediction residual is quantized and transmitted. For the first-order linear prediction, the equation becomes et= st + 1 st-1. If the predictor coefficient is set as 1 = -1, the system only transmit the difference between adjacent samples. This system called differential PCM(DPCM)

    This differential method is used to cope the accumulated encoder error and to achieve the maximum prediction gain

    8

    Differential PCM

    One nonstationary characteristic of a speech signal is that the variance and the autocorrelation function of the output source vary with time; while PCM and DPCM encoders are designed on the basis that the output source is stationary;The efficiency and performance of these encoders can be improved by adapting them to the slowly time-variant statistics of the speech signal

    In PCM and DPCM, the quantization error q(n) from a uniform quantizer operating on a nonstationary input signal is a time-variant variance (quantization noise power); improvement to reduce the dynamic range of the quantization noise is the use of an adaptive quantizer.

    The adaptive quantizer used in conjunction with PCM called Adaptive PCM (ADPCM); or with DPCM called Adaptive DPCM (ADPCM)

  • 59

    Adaptive PCM

    Method used in order to utilize the nonstationarity of the dynamic characteristics of speech amplitude for improving the SNR of quantized speech

    The step size of the quantization is varied according to the rms value of the amplitude.

    Since the speech signal can be considered to be stationary for a short period, the step size can be varied relatively slow

    Two different classifications of adaptive qualtizers: feedforward and feedback.

    10

    Adaptive PCM

    feedforward adaptive quantizer

    Step size is adjusted for each signal sample based on a short-term temporal estimate of the input speech signal variance

    The optimum step size is decided according to the rms value calculated for every block, and is transmitted to the receiver as side information

    /quantizerX)(~ ns)(ns

    Adaptive gain controller

    encoder decoderinput

  • 611

    Adaptive PCM

    feedback (backward) adaptive quantizer The step size does not need to be transmitted, since it can be

    automatically generated sample by sample by using reconstructed samples at both ends

    The output of the quantizer is used in the adjustment of the step size

    The forward adaptation is more efficient than the backward adaptation

    The backward adaptation has a higher bit rate, because of the side information

    /quantizerX)(~ ns)(ns

    Adaptive gain controller

    encoder decoderinput

    Adaptive gain controller

    output

    12

    Adaptive DPCM

    In Adaptive Differential PCM (ADPCM), the predictor is made adaptive

    The coefficients of the predictor can be changed periodically to reflect the changing signal statistics of the source

    )(~ ne-

    quantizer

    predictor

    +

    +

    +To channel

    Predictor adaptation

    encoder

    Step-size adaptation

    )(~ ne)(ns

    )(~ ns

    predictor

    +

    + To D/Aconverter

    decoder

    From channel)(~ ne

  • 713

    Adaptive DPCM

    Here, the short-term autocorrelation method is used to compute estimates of the LP parameters over the current frame

    The predictor coefficients determined, are transmitted along with the quantized error, to the receiver, which implements the same predictor

    Therefore, the transmission of the predictor coefficients results in a higher bit rate over the channel, offsetting in part the lower data rate achieved by having a quantizer with fewer bits (fewer levels) to handle the reduced dynamic range in the error resulting from adaptive prediction.

    As an alternative of transmitting the prediction coefficients, the reflection coefficients are transmitted. They have a smaller dynamic range and thus result in a lower bit rate

    14

    Adaptive DPCM

    A 32-kbps ADPCM standard has been established by CCITT (Consultative Committee for International Telephone and Telegraph) for international telephone communications and by ANSI (American National Standards Institute) for North American telephone systems

    The forward type of ADPCM, where optimum prediction is performed for every frame of speech signal, os called adaptive predictive coding (APC). This is the narrowest sense designates a coding system involving pitch prediction and two-level quantization for the predictive parameters

  • 815

    APC (Adaptive Prediction Coding)

    Viewed as an enhanced version of ADPCM where the periodicity of voiced speech is used to reduce the size of error. Thus fewer bits are needed to represent the error sequence

    16

    APC

    The speech signal is analyzed frame by frame to obtain the predictor coefficients i, pitch period M and amplitude of the pitch component . This information and quantization step width q for the residual signal, which together called side information, are transmitted along with the residual signal. This residual signal is quantized and 1-bit coded (two levels)

    Since linear prediction is performed using all samples in each frame, a large prediction gain can be obtained

    Subjective evaluation experiments indicated that when the sampling frequency is 6.67kHz (a transmission bit rate for the residual signal is 6.67kbps and a small amount of side information is additionally transmitted), the quality of coded speech is slightly lower than with 6-bit log PCM

  • 917

    Delta Modulation

    A simplified form of DPCM where two-level (1-bit) quantizer is used in conjunction with a fixed first-order predictor

    )(ns-

    quantizer

    Unit delay z-1

    +

    +

    + To channel

    )(ne 1)(~ =ne

    )(~ ns)1(~)(~ = nsnsZ-1

    +

    +

    |(i)|

    (n)output

    Lowpass filter

    )(~ ns

    This is an extreme method of differential quantization, where sampling frequency is raised so high that the difference between adjacent samples can be approximates by a 1-bit representation

    Advantage is its simple structure, based on the fact that the correlation between adjacent samples increases as a function of the sampling frequency except for uncorrelated signals. If the correlation increases, the prediction residual decreases

    18

    SBC (Subband Coding)

    It is a coding in frequency domain, where the speech band is divided into several neighbouring bands by a bank of band-pass filters (BPFs), and a specific coding strategy is employed for each band signal

    BPF1 Coder1DS1

    BPFN CoderNDSN

    multiple

    xer

    BPF1Deco-der1

    INT1

    BPFNDeco-derN

    INTN

    demultiple

    xer

    input output

    DS = down-sampling; INT = interpolation

    Speech signal passing through each BPF, is transformed into a baseband signal by low-frequency conversion, down-sampled at the Nyquist rate, and coded by an adaptive coding method, such as APCM

    The inverse procedures reproduce the original signal

  • 10

    19

    SBC

    Design of the filter is important in achieving good performance of the SBC

    Advantages:

    Processing concerning human auditory characteristics such as noise shaping can easily be applied

    A higher bit rate can be allocated to those bands in which higher speech energy in concentrated or to those bands which are subjectively more important

    Produce less perceptible quantization noise at the same or even at a lower bit rate

    The quantization noise produced in one band does not influence any other band; Or low-level speech input will not be corrupted by quantization noise in another band

    20

    SBC

    Since a short-time frequency analysis of input signals is performed in the human auditory system, the method for controlling the quantization noise in the frequency domain is effective and relatively natural

    The filter bank necessary for this method is realized by general digital filters which handles analog sampled values. The most reasonable way of dividing the frequency band is to equalize the contributions to the articulation index from all subbands

    Although this method is classified as a frequency-domain coding, it can also be defined as a time-domain coding method, where input signals are subdivided into frequency bands, and quantized.

  • 11

    21

    ATC (Adaptive Transform Coding)

    It is a method where a speech signal is divided into several frequency bands, similar as in SBC. Here, a speech wave is divided into frames, where each frame can be considered stationary.Each speech frame is first orthogonally transformed into frequency-domain components, which are subsequently processed by adaptive quantization

    22

    ATC (Adaptive Transform Coding)

    At the decoder stage, the speech wave is reproduced by concatenating the inverse-transformed block waveforms

    The system usually used a discrete cosine transform and adaptive bit allocation for transformation and quantization.

    To achieve coding efficiency, more bits are assigned to the more important spectral coefficients and fewer bits to the less important spectral coefficients.

    By using a dynamic allocation in the assignment of the total number of bits to the spectral coefficients, the changing statistics of speech signal can be adapted.

  • 12

    23

    Vocoders

    The previous waveform coding techniques are based on either a sample-by-sample, or a frame-by-frame, speech waveform representation either in the time or frequency domain

    Here, the method is done based on the representation of a speech signal by an all-pole model of the vocal system. In another words, the speech production system is modeled as an all-pole filter

    For voiced speech, the excitation is a periodic impulse train with period equal to the pitch period of the speech; For unvoiced speech, the excitation is a white noise sequence

    Basically, in the vocoders the model parameters is estimated from frames of speech (speech analysis), encode and transmit the parameters to the receiver on a frame-by-frame basis, and reconstruct the speech signal from the model (speech synthesis) at the receiver

    24

    Vocoders

    Most widely discussed, such as channel vocoders, phase vocoders, formant vocoders or cepstral vocoders

  • 13

    25

    LPC (Linear Predictive Coders)

    This is a time-domain vocoders, where the significant features of speech is extracted from the time waveform

    The LPC is computationally intensive, however it is the most popular among the class of low bit rate vocoders

    26

    LPC (Linear Predictive Coders) Advantages:

    The system is free from quality degradation due to source modeling

    A low-frequency waveform is exactly reproduced within the limit of the quantization error

    Spectral information for the entire frequency range is efficiently represent by this method

    Since pitch period estimation and voiced/unvoiced decision are not necessary, the system is free from both pitch estimation error and voiced/unvoiced decision error.

    Most widely discussed, such as residual excited LPC and multipulse LPC

  • 14

    27

    Performance Evaluation

    Two techniques for evaluating the quality of speech coded in various methods:

    Subjective evaluation The listening tests are conducted by playing the sample to a number

    of listener and asking them to judge the quality of the speech The tests provide results in terms of overall quality, listening effort,

    intelligigibility, and naturalness Examples:1. A-B discrimination test

    Test transparency of the quantizer, for broadcast-quality coders.

    Force the listeners to guess which of two signals was the original, and which was quantized

    28

    Performance Evaluation

    2. Diagnostic Rhyme Test (DRT) The most popular and widely used intelligibility test.

    Measure the listener ability to identify the spoken word.

    Here, a word from a pair of rhymed words such as those-dose is presented to the listener and the listener is asked to identify which word was spoken

    Typical percentage correct on the DRT tests, range from 75-90%3. Diagnostic Acceptability Measure (DAM)

    Evaluate acceptability of speech coding systems

    These tests results are difficult to rank and hence require a reference system

  • 15

    29

    Performance Evaluation

    4. Mean Opinion Score (MOS) The most popular ranking system

    Ask listeners to rate signals on a five-point scale

    Average across listeners, and across sentences

    Quality Scale Score Listening Effort ScaleExcellent 5 No effort required

    Good 4 No appreciable effort requiredFair 3 Moderate effort requiredPoor 2 Considerable effort requiredBad 1 No meaning understood with reasonable effort

    30

    Performance Evaluation

    Objective evaluation Have a general nature of a signal-to-noise ratio

    Provide a quantitative value of how well the reconstructed speech approximates the original speech

    It doesnt necessarily give an indication of speech quality as perceived by the human ear

    Examples: Mean Square Error (MSE) distortion, frequency weighted MSE, SNR, segmented SNR, etc.

  • 16

    31

    References

    Z.N. Li and M.S. Drew, Fundamentals of Multimedia, Pearson Education, 2004

    T.F. Quatieri, Discrete-Time Speech Signal Processing, Principles and Practice, Prentice Hall, 2002

    J.R. Deller, J.G. Proakis and J.H.L. Hansen, Discrete-Time Processing of Speech Signals, Prentice Hall, 1993

    S. Furui, Digital Speech Processing, Synthesis, and Recognition, Marcel Dekker, 1989

    B. Gold and N. Morgan, Speech and Audio Signal Processing, Processing and Perceptual of Speech and Music, John Wiley & Sons, 2000

    32

    References

    T. Painter and A. Spanias, Perceptual Coding of Digital Audio, Proc. of IEEE, vol. 88. No 4, April 2000