Speech Coding 2014

11

Speech Coding

Speech coders

Source codersWaveform coders

LPC VocodersFrequency domainTime domain

Nondifferential Differential SBC ATC

APCCVSDM

ADPCMDeltaPCM

2

PCM (Pulse Code Modulation)

Here, analog signals are quantized in homogeneous steps similar to the usual A/D conversion

It does not compress the information rate, since it does not use speech-specific characteristics; based on the statistical characteristics of speech amplitude

The quantization bits B must satisfy: orWhere is the quantization step size, and L is the range of signal amplitude

The number of bits B must be decided so that the SNR of the quantized signal is larger than that of the signal before quantization

PCM used in the ordinary telephone system called log PCM, since the amplitude is compressed by logarithmic transformation before linear quantization and coding

L2B )L(logB 2

23

PCM

Since the amplitude of a speech signal has an exponential distribution, the occurrence probability for each bit is equalized by the logarithmic transformation

Two types of transformation expressions that control the compression:

-law: Logarithmic compressor used in North American

telecommunications systems

Has an input-output magnitude characteristics of the form:

where |s| is the magnitude of the input, |y| is the magnitude of the output, and is a parameter that is selected to give the desired compression characteristics

( )( )

+

+=

11

loglog sy

4

PCM

The larger the value of , the larger the amount of compression

Typically is chosen between 100 and 500. = 255 is chosen as a standard encoding of speech waveforms, in US and Canada

A-law: Logarithmic compressor used in European telecommunications

systems

Has an input-output magnitude characteristics of the form:

where |s| is the magnitude of the input, |y| is the magnitude of the output, and A is a parameter that is selected to give the desired compression characteristics

AsAy

log log

+

+=

11

35

PCM

A = 87.56 is chosen as a standard encoding of speech waveforms Even the two compression characteristics are different non-linear

functions, the characteristics are very similar.

6

Differential PCM

This method based on the predictive coding:

Information compression can be achieved by coding the difference between adjacent samples or the difference between the actual sample value and the predicted value calculated using the correlation (prediction residual);

Can be done since a speech signal has a correlation between adjacent samples as well as between distant sample;

The quantization bits can be reduced, because the difference and prediction residual have a smaller range of variation and smaller mean of energy than that of the original signal.

predictor

+

+

|(i)|

(n) To D/A converter

)(ne

-

quantizer

predictor

+

+To channel

)(~ ne

)(~ ns)(~ ns

)(ns

47

Differential PCM

When the prediction is performed according to linear prediction, the prediction residual is quantized and transmitted. For the first-order linear prediction, the equation becomes et= st + 1 st-1. If the predictor coefficient is set as 1 = -1, the system only transmit the difference between adjacent samples. This system called differential PCM(DPCM)

This differential method is used to cope the accumulated encoder error and to achieve the maximum prediction gain

8

Differential PCM

One nonstationary characteristic of a speech signal is that the variance and the autocorrelation function of the output source vary with time; while PCM and DPCM encoders are designed on the basis that the output source is stationary;The efficiency and performance of these encoders can be improved by adapting them to the slowly time-variant statistics of the speech signal

In PCM and DPCM, the quantization error q(n) from a uniform quantizer operating on a nonstationary input signal is a time-variant variance (quantization noise power); improvement to reduce the dynamic range of the quantization noise is the use of an adaptive quantizer.

The adaptive quantizer used in conjunction with PCM called Adaptive PCM (ADPCM); or with DPCM called Adaptive DPCM (ADPCM)

59

Adaptive PCM

Method used in order to utilize the nonstationarity of the dynamic characteristics of speech amplitude for improving the SNR of quantized speech

The step size of the quantization is varied according to the rms value of the amplitude.

Since the speech signal can be considered to be stationary for a short period, the step size can be varied relatively slow

Two different classifications of adaptive qualtizers: feedforward and feedback.

10

Adaptive PCM

feedforward adaptive quantizer

Step size is adjusted for each signal sample based on a short-term temporal estimate of the input speech signal variance

The optimum step size is decided according to the rms value calculated for every block, and is transmitted to the receiver as side information

/quantizerX)(~ ns)(ns

Adaptive gain controller

encoder decoderinput

611

Adaptive PCM

feedback (backward) adaptive quantizer The step size does not need to be transmitted, since it can be

automatically generated sample by sample by using reconstructed samples at both ends

The output of the quantizer is used in the adjustment of the step size

The forward adaptation is more efficient than the backward adaptation

The backward adaptation has a higher bit rate, because of the side information

/quantizerX)(~ ns)(ns


encoder decoderinput


output

12

Adaptive DPCM

In Adaptive Differential PCM (ADPCM), the predictor is made adaptive

The coefficients of the predictor can be changed periodically to reflect the changing signal statistics of the source

)(~ ne-

quantizer

predictor

+

+

+To channel

Predictor adaptation

encoder

Step-size adaptation

)(~ ne)(ns

)(~ ns

predictor

+

+ To D/Aconverter

decoder

From channel)(~ ne

713

Adaptive DPCM

Here, the short-term autocorrelation method is used to compute estimates of the LP parameters over the current frame

The predictor coefficients determined, are transmitted along with the quantized error, to the receiver, which implements the same predictor

Therefore, the transmission of the predictor coefficients results in a higher bit rate over the channel, offsetting in part the lower data rate achieved by having a quantizer with fewer bits (fewer levels) to handle the reduced dynamic range in the error resulting from adaptive prediction.

As an alternative of transmitting the prediction coefficients, the reflection coefficients are transmitted. They have a smaller dynamic range and thus result in a lower bit rate

14

Adaptive DPCM

A 32-kbps ADPCM standard has been established by CCITT (Consultative Committee for International Telephone and Telegraph) for international telephone communications and by ANSI (American National Standards Institute) for North American telephone systems

The forward type of ADPCM, where optimum prediction is performed for every frame of speech signal, os called adaptive predictive coding (APC). This is the narrowest sense designates a coding system involving pitch prediction and two-level quantization for the predictive parameters

815

APC (Adaptive Prediction Coding)

Viewed as an enhanced version of ADPCM where the periodicity of voiced speech is used to reduce the size of error. Thus fewer bits are needed to represent the error sequence

16

APC

The speech signal is analyzed frame by frame to obtain the predictor coefficients i, pitch period M and amplitude of the pitch component . This information and quantization step width q for the residual signal, which together called side information, are transmitted along with the residual signal. This residual signal is quantized and 1-bit coded (two levels)

Since linear prediction is performed using all samples in each frame, a large prediction gain can be obtained

Subjective evaluation experiments indicated that when the sampling frequency is 6.67kHz (a transmission bit rate for the residual signal is 6.67kbps and a small amount of side information is additionally transmitted), the quality of coded speech is slightly lower than with 6-bit log PCM

917

Delta Modulation

A simplified form of DPCM where two-level (1-bit) quantizer is used in conjunction with a fixed first-order predictor

)(ns-

quantizer

Unit delay z-1

+

+

+ To channel

)(ne 1)(~ =ne

)(~ ns)1(~)(~ = nsnsZ-1

+

+

|(i)|

(n)output

Lowpass filter

)(~ ns

This is an extreme method of differential quantization, where sampling frequency is raised so high that the difference between adjacent samples can be approximates by a 1-bit representation

Advantage is its simple structure, based on the fact that the correlation between adjacent samples increases as a function of the sampling frequency except for uncorrelated signals. If the correlation increases, the prediction residual decreases

18

SBC (Subband Coding)

It is a coding in frequency domain, where the speech band is divided into several neighbouring bands by a bank of band-pass filters (BPFs), and a specific coding strategy is employed for each band signal

BPF1 Coder1DS1

BPFN CoderNDSN

multiple

xer

BPF1Deco-der1

INT1

BPFNDeco-derN

INTN

demultiple

xer

input output

DS = down-sampling; INT = interpolation

Speech signal passing through each BPF, is transformed into a baseband signal by low-frequency conversion, down-sampled at the Nyquist rate, and coded by an adaptive coding method, such as APCM

The inverse procedures reproduce the original signal

10

19

SBC

Design of the filter is important in achieving good performance of the SBC

Advantages:

Processing concerning human auditory characteristics such as noise shaping can easily be applied

A higher bit rate can be allocated to those bands in which higher speech energy in concentrated or to those bands which are subjectively more important

Produce less perceptible quantization noise at the same or even at a lower bit rate

The quantization noise produced in one band does not influence any other band; Or low-level speech input will not be corrupted by quantization noise in another band

20

SBC

Since a short-time frequency analysis of input signals is performed in the human auditory system, the method for controlling the quantization noise in the frequency domain is effective and relatively natural

The filter bank necessary for this method is realized by general digital filters which handles analog sampled values. The most reasonable way of dividing the frequency band is to equalize the contributions to the articulation index from all subbands

Although this method is classified as a frequency-domain coding, it can also be defined as a time-domain coding method, where input signals are subdivided into frequency bands, and quantized.

11

21

ATC (Adaptive Transform Coding)

It is a method where a speech signal is divided into several frequency bands, similar as in SBC. Here, a speech wave is divided into frames, where each frame can be considered stationary.Each speech frame is first orthogonally transformed into frequency-domain components, which are subsequently processed by adaptive quantization

22

ATC (Adaptive Transform Coding)

At the decoder stage, the speech wave is reproduced by concatenating the inverse-transformed block waveforms

The system usually used a discrete cosine transform and adaptive bit allocation for transformation and quantization.

To achieve coding efficiency, more bits are assigned to the more important spectral coefficients and fewer bits to the less important spectral coefficients.

By using a dynamic allocation in the assignment of the total number of bits to the spectral coefficients, the changing statistics of speech signal can be adapted.

12

23

Vocoders

The previous waveform coding techniques are based on either a sample-by-sample, or a frame-by-frame, speech waveform representation either in the time or frequency domain

Here, the method is done based on the representation of a speech signal by an all-pole model of the vocal system. In another words, the speech production system is modeled as an all-pole filter

For voiced speech, the excitation is a periodic impulse train with period equal to the pitch period of the speech; For unvoiced speech, the excitation is a white noise sequence

Basically, in the vocoders the model parameters is estimated from frames of speech (speech analysis), encode and transmit the parameters to the receiver on a frame-by-frame basis, and reconstruct the speech signal from the model (speech synthesis) at the receiver

24

Vocoders

Most widely discussed, such as channel vocoders, phase vocoders, formant vocoders or cepstral vocoders

13

25

LPC (Linear Predictive Coders)

This is a time-domain vocoders, where the significant features of speech is extracted from the time waveform

The LPC is computationally intensive, however it is the most popular among the class of low bit rate vocoders

26

LPC (Linear Predictive Coders) Advantages:

The system is free from quality degradation due to source modeling

A low-frequency waveform is exactly reproduced within the limit of the quantization error

Spectral information for the entire frequency range is efficiently represent by this method

Since pitch period estimation and voiced/unvoiced decision are not necessary, the system is free from both pitch estimation error and voiced/unvoiced decision error.

Most widely discussed, such as residual excited LPC and multipulse LPC

14

27

Performance Evaluation

Two techniques for evaluating the quality of speech coded in various methods:

Subjective evaluation The listening tests are conducted by playing the sample to a number

of listener and asking them to judge the quality of the speech The tests provide results in terms of overall quality, listening effort,

intelligigibility, and naturalness Examples:1. A-B discrimination test

Test transparency of the quantizer, for broadcast-quality coders.

Force the listeners to guess which of two signals was the original, and which was quantized

28


2. Diagnostic Rhyme Test (DRT) The most popular and widely used intelligibility test.

Measure the listener ability to identify the spoken word.

Here, a word from a pair of rhymed words such as those-dose is presented to the listener and the listener is asked to identify which word was spoken

Typical percentage correct on the DRT tests, range from 75-90%3. Diagnostic Acceptability Measure (DAM)

Evaluate acceptability of speech coding systems

These tests results are difficult to rank and hence require a reference system

15

29


4. Mean Opinion Score (MOS) The most popular ranking system

Ask listeners to rate signals on a five-point scale

Average across listeners, and across sentences

Quality Scale Score Listening Effort ScaleExcellent 5 No effort required

Good 4 No appreciable effort requiredFair 3 Moderate effort requiredPoor 2 Considerable effort requiredBad 1 No meaning understood with reasonable effort

30


Objective evaluation Have a general nature of a signal-to-noise ratio

Provide a quantitative value of how well the reconstructed speech approximates the original speech

It doesnt necessarily give an indication of speech quality as perceived by the human ear

Examples: Mean Square Error (MSE) distortion, frequency weighted MSE, SNR, segmented SNR, etc.

16

31

References

Z.N. Li and M.S. Drew, Fundamentals of Multimedia, Pearson Education, 2004

T.F. Quatieri, Discrete-Time Speech Signal Processing, Principles and Practice, Prentice Hall, 2002

J.R. Deller, J.G. Proakis and J.H.L. Hansen, Discrete-Time Processing of Speech Signals, Prentice Hall, 1993

S. Furui, Digital Speech Processing, Synthesis, and Recognition, Marcel Dekker, 1989

B. Gold and N. Morgan, Speech and Audio Signal Processing, Processing and Perceptual of Speech and Music, John Wiley & Sons, 2000

32

References

T. Painter and A. Spanias, Perceptual Coding of Digital Audio, Proc. of IEEE, vol. 88. No 4, April 2000

Documents

Speech Coding 2014