Chapter 8 Speech Coding · 8.4 Adaptive Differential Pulse Code Modulation (ADPCM) l Amore efficient coding scheme l Exploits the redundancies present in the speech signal between

Chapter 8: Speech Coding

School of Information Science and Engineering, SDU

l The performance of speech coders determines the quality of the recovered speech and the capacity of the system.

l In mobile communication systems, bandwidth is a preciouscommodity, and service providers are continuously met with the challenge of accommodating more users within a limited allocated bandwidth.

l The lower the bit rate at which the coder can deliver toll quality speech, the more speech channels can be compressed within a given bandwidth.

For this reason, manufacturers and service providers arecontinuously in search of speech coders that will provide toll quality speech at lower bit rates.

8.1 Introductionl The goal of all speech coding systems:

to transmit speech with the highest possible quality using the least possible channel capacity.

This has to be accomplished while maintaining certain required levels of complexity of implementation and communication delay.

l In general, there is a positive correlation between coder bit-rate efficiency and the algorithmic complexity required to achieve it.

A balance needs to be struck between these conflicting factors.

Two categories of coders:(Based on the means by which they achieve compression)

l Waveform Codersl Vocoders.

(1)Waveform coders:reproduce the time waveform of the speech signal as closely as reproduce the time waveform of the speech signal as closely as

possible.possible.l Source independentl Code equally well a variety of signals.l Robust for a wide range of speech characteristics and for noisy

environments. l With minimal complexityl Achieves only moderate economy in transmission bit rate.

Examples:1. Pulse code modulation (PCM)2. Differential pulse code modulation (DPCM)3. Adaptive differential pulse code modulation (ADPCM)4. Delta modulation (DM)5. Continuously variable slope delta modulation (CVSDM)6. Adaptive predictive coding (APC).

8.2 Characteristics of Speech Signals

l Speech waveforms have a number of useful properties that can be exploited when designing efficient coders.l Nonuniform probability distribution of speech amplitudel Nonzero autocorrelation between successive speech samplesl Nonflat nature of the speech spectral Existence of voiced and unvoiced segments in speechl Quasiperiodicity of voiced speech signals

l The most basic property is bandlimited.Time discretized possible at a finite rate and reconstructed

completely from its samples.

1) Probability Density Function (pdf)l Characteristics of speech signal pdf:l very high probability of near-zero amplitudesl Significant probability of very high amplitudesl Monotonically decreasing function of amplitudes between

these extremes. Exact distribution depends on the input bandwidth and recording

conditions.Nonuniform quantizers, including the vector quantizers. attempt to match

the distribution of quantization levels to that of the pdf of the input speech signal.

l An approximation to the long-term pdf of telephone quality speech signals:

l Two-sided exponential (Laplacian) function equationl There is a distinct peak at zero

due to the existence of frequent pauses and low level speech segments.

l Short-time pdfs of speech segments are also single-peaked functions and are usually approximated as a Gaussian distribution.

2) Autocorrelation Function (ACF)l There exists much correlation between adjacent samples of a

segment of speech.allow easily predicting.All differential and predictive coding schemes are based on this

l Definition:

l ACF gives a quantitative measure of the closeness between samples.

l Typical signals have an adjacent sample correlation, C(1) , as high as 0.85 to 0.9.

3) Power Spectral Density Function (PSD)

l PSD is nonflat.High frequency components contribute very little to the total

speech energy.l Can be used to obtain significant compression in frequency

domain. l Coding speech separately in different frequency bands can

lead to significant coding gain.Though high frequency is insignificant in energy, they are very

important carriers of speech information, and hence need to be adequately represented.

l A qualitative measure of the theoretical maximum coding gainthat can be obtained by exploiting the nonflat characteristics of the PDF, is given by the spectral flatness measure (SFM).

l SFM is defined as the ratio of the arithmetic to geometric mean of the samples of the PSD taken at uniform intervals in frequency.

8.3 Quantization Techniques

(1) Uniform Quantizationl Quantization is the process of mapping a continuous range of

amplitudes of a signal into a finite set of discrete amplitudes.l The operation is irreversible.l Introduces distortion.

determines to a great extent the overall distortionl One of the most frequently used measures of distortion:

MSE (mean square error)

l The distortion introduced by a quantizer is often modeled as additive quantization noise

l The performance of a quantizer is measured as the output signal-to-quantization noise ratio (SQNR).

l The SQNR of a PCM encoder:

where a = 4.77 for peak SQNR and a = 0 for the average SQNR. with one additional bit, the output SQNR improves by 6 dB.

(2) Nonuniform Quantizationl Distribute the quantization levels in accordance with the pdf of

the input waveform.l Mean square distortion:

l To design an optimal nonuniform quantizer, we need to determine the quantization levels which will minimize the distortion of a signal with a given pdf.

l The Lloyd-Max algorithm provides a method to determine the optimum quantization levels by iteratively changing the quantization levels in manner that minimizes the mean square distortion.

l A simple and robust implementation:logarithmic quantizer.

l Different companding techniques:

l -law (U.S)

l A-law (Europe)

µ

(3) Adaptive Quantizationl There is a distinction between the long term and short term pdf

of speech waveforms.because of the nonstationarity characteristic.usually the dynamic range is 40 dB or more.

l Time varying quantization technique is useful.varies the step size in accordance to the input signal power.

(4) Vector Quantization

l Shannon's Rate-Distortion Theorem:There exists a mapping from a source waveform to output code

words such that for a given distortion D, R(D) bits per sample are sufficient to reconstruct the waveform with an average distortion arbitrarily close to D.

l R(D) is called the rate-distortion function, represents a fundamental limit on the achievable rate for a given distortion

l The actual rate R has to be greater than R(D).l Shannon predicted that better performance can be achieved by coding

many samples at a time instead of one sample at a time.. l Vector quantization (VQ)

a delayed-decision coding technique which maps a group of input samples (typically a speech frame), called a vector, to a code book index.l A code book is set up consisting of a finite set of vectors covering the

entire anticipated range of values. l In each quantizing interval, the code-book is searched and the index of

the entry that gives the best match to the input signal frame is selected.

l VQ can yield better performance even when the samples are independent of one another.

l The number of samples in a block (vector) is called the dimension L of the vector quantizer.

l The rate R of the vector quantizer is defined as:

n is the size of the VQ code book.R may take fractional values.

l Quantization vectors are used instead of quantization levelsl Distortion is measured as the squared Euclidean distance

between the quantization vector and the input vector.l VQ is most efficient at very low bit rates (R = 0.5 bits/sample or

less).

l But VQ is a computationally intensivel Not often used to code speech signals directly.l Usually used to quantize the speech analysis parameters, such

asl Linear prediction coefficientsl spectral coefficientsl filter bank energies, etc.

8.4 Adaptive Differential Pulse Code Modulation (ADPCM)l A more efficient coding scheme

l Exploits the redundancies present in the speech signalbetween adjacent samples.

l The difference between adjacent samples is transmitted.

l Allows speech to be encoded at a bit rate of 32 kbps.The CCITT standard G.721 ADPCM algorithm for 32 kbps

speech coding is used in cordless telephone systems like CT2 andDECT.

l Signal prediction techniques is used.

8.5 Frequency Domain Coding of Speechl Speech signal is divided into a set of frequency components

which are quantized and encoded separately.

l Different frequency bands can be preferentially encoded according to some perceptual criteria for each band.

l The quantization noise can be contained within bands and prevented from creating harmonic distortions outside the band.

Advantage:The number of bits used to encode each frequency

component can be dynamically varied and shared among the different bands.

(1) Sub-band Codingl The human ear does not detect the quantization distortion at all

frequencies equally well.

l It is therefore possible to achieve substantial improvement inquality by coding the signal in narrower bands.

l In a sub-band coder, speech is typically divided into four or eight sub-bands by a bank of filters, and each subband is sampled at a bandpass Nyquist rate and encoded with different accuracy in accordance to a perceptual criteria.

(1) Sub-band CodingWays of Band-splitting:l Divide the entire speech band into unequal sub-bands that

contribute equally to the articulation index(清晰度指数).

method suggested by Crochiere:Sub-band Number Frequency Range

1 200-700 Hz2 700-1310 Hz3 1310-2020 Hz4 2020-3200 Hz

l Divide band into equal sub-bands and assign to each sub-band number of bits proportional to perceptual significance.

octave（音阶） band splitting is often employed instead of equal splitting. As the human ear has an exponential decreasing sensitivity to frequency, this kind of splitting is more in tune with the perception process.

(1) Sub-band CodingMethod for processing the sub-band signals:

make a low pass translation of the sub-band signal to zero frequency by a modulation process equivalent to single sideband modulation.

l The low pass translation technique is straightforward and takes advantage of a bank of nonoverlapping bandpassfilters.

l Perceptible aliasing effects exist unless we use sophisticated bandpass filters.

Some techniques has been developed to deal with it.

Sub-band coding is useful for lower bit rates in the range 9.6 to 32 kbps.

Especially when bit rate below 16kbps.The CD-900 cellular telephone system uses sub-band coding

speech compression.

(2) Adaptive Transform Codingl Make the transformations of windowed input segments of the

speech waveform. Each segment is represented by a set of transform coefficients, which are separately quantized and transmitted.

More complex

l Successfully used to encode speech at bit rates in the range 9.6 kbps to 20 kbps.

l Discrete cosine transform (DCT) is usually used to implement the transform.

l The DCT of a N-point sequence x (n):

DCT

IDCT

Fast algorithms are developed to computing DCT and IDCT.

8.6 Vocodersl Vocoders are a class of speech coding systems that analyze

the voice signal at the transmitter, transmit parameters derived from the analysis, and then synthesize the voice at the receiver using those parameters.

All vocoder systems attempt to model the speech generation process as a dynamic system and try to quantify certain physical constraints of the system.

l Characteristics:l Much more complex than the waveform codersl Achieve very high economy in transmission bit rate.l Less robustl Tends to be talker dependent.

l Types:l Linear predictive coder (LPC). 线性预测编码器l Channel vocoder 信道声码器l Formant vocoder 共振峰声码器l Cepstrum vocoder 倒谱声码器l Voice excited vocoder. 语音激励声码器

8.6 Vocoders

l All vocoding systems are based on speech generation model.l The sound generating mechanism forms the source and is linearly

separated from the intelligence modulating vocal tract filter which forms the system.

8.7 Linear Predictive Coders (LPCs)8.7.1 LPC Vocodersl Belong to the time domain class of vocoders.l Attempts to extract the significant features of speech from the time

waveform. l Computationally intensive, but most popular among the class of

low bit rate vocoders.l Transmit good quality voice at 4.8 kbps and poorer quality voice at

even lower rates.l Models the vocal tract as an all pole linear filter

l excitation to the filter is either a pulse at the pitch frequency or random white noise depending on whether the speech segment is voiced or unvoiced.

l The coefficients of the all pole filter are obtained in the time domain using linear prediction techniques

l The prediction principles are similar to those in ADPCM, butytransmits only selected characteristics of the error signal, includes:l G factorl Pitch informationl Voiced/unvoiced decision information.

l At the receiver, the received information about the error signal is used to determine the appropriate excitation for the synthesis filter.

That is, the error signal is the excitation to the decoder.

l The synthesis filter is designed at the receiver using the received predictor coefficients.

l Various LPC schemes differ in the way they recreate the error signal (excitation) at the receiver. Three alternativesare shown below.

l The First one is most popular.l It uses two sources at the receiver, one of white noise and the

other with a series of pulses at the current pitch rate.l The selection of either of these excitation methods is based on the

voiced/unvoiced decision made at the transmitter and communicated to the receiver along with the other information.

l This technique requires that the transmitter extract pitch frequency information which is often very difficult.

l Moreover, the phase coherence between the harmonic components of the excitation pulse tends to produce a buzzy twang(蜂鸣声) in the synthesized speech.

These problems are mitigated in the other two approaches: Multi-pulse excited LPC and stochastic or code excited LPC.

8.7.2 MultI-pulse Excited LPC

No matter how well the pulse is positioned, excitation by a single pulse per pitch period produces audible distortion.

l Atal suggested using more than one pulse, typically eight per period, and adjusting the individual pulse positions and amplitudes sequentially to minimize a spectrally weighted mean square error.

l This technique called the multipulse excited LPC (MPE-LPC).l Can results in better speech quality, because

The prediction residual is better approximated by several pulses per pitch period

The multi-pulse algorithm does not require pitch detection.

8.7.3 Code-Excited LPCl In this method, the coder and decoder have a predetermined code

book of stochastic (zero-mean white Gaussian) excitation signals.l For each speech signal the transmitter searches through its code

book of stochastic signals for the one that gives the best perceptual match to the sound when used as an excitation to the LPC filter.

l The index of the code book where the best match was found is then transmitted.

l The receiver uses this index to pick the correct excitation signal for its synthesizer filter.

l Extremely complex, but can provide high quality even when the excitation is coded at only 0.25 bits per sample.

Advances in DSP and VLSI technology have made real-time implementation of CELP codecs possible.

Example: CDMA digital cellular standard (15-95)----variable rate CELP codec at 1.2 to 14.4 kbps, and QCELP13 at 13.4 kbps.

8.7.4 Residual Excited LPC

l In this class of LPC coders, after estimating the model parameters(LP coefficients or related parameters) and excitation parameters(voiced/unvoiced decision, pitch, gain) from a speech frame, thespeech is synthesized at the transmitter and subtracted from the original speech signal to from a residual signal.

l The residual signal is quantized, coded, and transmitted to the receiver along with the LPC model parameters.

l At the receiver the residual error signal is added to the signal generated using the model parameters to synthesize an approximation of the original speech signal.

l The quality of the synthesized speech is improved due to the addition of the residual error.

8.8 Choosing Speech Codecs for Mobile Communications

l Choosing the right speech codec is an important step in the design of a digital mobile communication system.

l A balance must be struck between the perceived quality of the speech resulting from this compression and the overall system cost and capacity

l Other criterion includes:l The end-to-end encoding delayl The algorithmic complexity of the coderl The d.c. power requirementsl compatibility with existing standardsl Robustness of the encoded speech to transmission errors.

Different speech coders show varying degree of immunity to transmission errors.

l The choice of the speech coder will also depend on the cell size used.

Cordless telephone systems:l cell size is sufficiently small, high spectral efficiency is

achieved through frequency reuse, thus a simple high rate speech codec is enough.

l In CT2 and DECT. which use very small cells (microcells), 32 kbps ADPCM coders are used to achieve acceptable performance even without channel coding and equalization.

Cellular systems:l poorer channel conditions need to use error correction codingl requiring the speech codecs to operate at lower bit rates.

Mobile satellite communications:l Cell sizes are very large, available bandwidth is very small.l Speech rate must be of the order of 3 kbps, requiring the use

of vocoder techniques.

l The type of multiple access technique used, being an important factor in determining the spectral efficiency of the system, strongly influences the choice of speech codec.

l The type of modulation employed also has considerable impact on the choice of speech codec.

8.9 The GSM Codec

The speech coder used in the pan-European digital cellular standard GSM, was chosen after conducting exhaustive subjective tests on various competing codecs.

l The name is rather grandiose:Regular pulse excited long term prediction (RPERegular pulse excited long term prediction (RPE--LTP) codec.LTP) codec.

l Net bit rate: 13 kbps.l RPE-LTP is a combination of two proposed codec:l baseband RELP codec, proposed by French.l Advantage: provides good quality speech at low complexity.l Drawback: affected by channel errors.

l multi-pulse excited long-term prediction (MPE-LTP) codec, proposed by Germany. l Advantage: produces excellent speech quality, not much

affected by bit errors in the channel.l Drawback: high complexity.

l By modifying the RELP codec to incorporate certain features of the MPE-LTP codec, the net bit rate was reduced from 14.77 kbps to 13.0 kbps without loss of quality.

The most important modification was the addition of a long-term prediction Loop.

l The GSM codec is relatively complex and power hungry.

STP----short time predictionLTP----long time predictionRPE----regular pulse excitation

8.10 The USDC Codec（Skip）

8.11 Performance Evaluation of Speech Coders

There are two approaches to evaluating the performance of a speech coder in terms of its ability to preserve the signal quality

(1) Objective measuresl Mean square error (MSE) distortionl Frequency weighted MSEl Segmented SNRl Articulation index, etc.

l have the general nature of a signal-to-noise ratio and provide a quantitative values of how well the reconstructed speech approximates the original speech.

l Useful in initial design and simulation of coding systemsl Do not necessarily give an indication of speech quality as

perceived by the human ear.because it is the listener who is the ultimate judge of the signal

quality


(2) Subjective listening testsl Playing the sample to a number of listeners and asking them to

judge the quality of the speech. Speech coders are highly speaker dependent in that the quality

varies with the age and gender of the speaker, the speed at which the speaker speaks and other factors.

l Carried out in different environments to simulate real life conditions

Such as noisy, multiple speakers, etc.l Terms used to describe the results:

l Overall qualityl Listening effortl Intelligibility

which measure the listeners ability to identify the spoken word.l Naturalness.


l These kinds of tests results are difficult to rank and hence require a reference system.

l The most popular ranking system:----mean opinion score (MOS) ranking.

l A five point quality ranking scalel Each point associated with a standardized descriptions:

• In general, the MOS rating of a speech codec decreases with decreasing bit rate.

Documents

Chapter 8 Speech Coding · 8.4 Adaptive Differential Pulse Code Modulation (ADPCM) l Amore efficient coding scheme l Exploits the redundancies present in the speech signal between