11
Vector Quantization of Mel-Cepstral Coefficients Using Distortion Measure for Spectral Analysis Toru Takahashi, 1 Keiichi Tokuda, 1 Takao Kobayashi, 2 and Tadashi Kitamura 1 1 Department of Computer Science, Nagoya Institute of Technology, Nagoya, 466-8555 Japan 2 Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, 226-8502 Japan SUMMARY This paper proposes a method for vector quantization of mel-cepstral coefficients based on a measure for spectral analysis. It is a method in which the match between the input signal within the frame and the mel-cepstral coeffi- cients stored in the codebook is evaluated by using a statis- tical measure for spectral analysis, and the quantized mel-cepstral coefficients are derived. The codebook train- ing algorithm for the proposed method is also presented. The proposed method is further applied to multistage vector quantization. It is shown by an experiment using a speech database that the proposed method with a codebook size of 9 bits can achieve the same quantization distortion as the conventional 12-bit method (vector quantization based on Euclidean distance between mel-cepstral coefficients). © 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(13): 104–114, 2007; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.10432 Key words: multistage vector quantization; vector quantization; statistical measure; mel-cepstral analysis. 1. Introduction Applications based on speech coding techniques, such as portable phones and Internet telephony, have re- cently come into widespread use as important aspects of the social infrastructure. It is considered very important to encode speech signals efficiently in order to reduce the communication cost and the cost of speech information storage in these applications. Consequently, various speech coding methods have been proposed. However, most of them are based on some spectral analysis procedure. Thus, a crucial factor that determines the system characteristics is the choice of a representation of the spectrum and of the method used to quantize and transmit the parameters of the spectral analysis. The general approach in vector quantization of the parameters of spectral analysis [1] has been to convert the time-series signal that is the target of analysis to quantized parameters in two stages, the analysis and quantization stages. In such an approach, there are different criteria in the two stages. On the other hand, it seems possible that the parameters may be quantized more efficiently by handling the time-series signal that is the target of analysis by means of a unified criterion throughout the analysis and quantiza- tion. Consequently, this paper proposes a vector quantiza- tion method in which a unified measure common to the analysis and quantization stages is applied to a spectral model with exponential function representation (exponen- tial spectral model) using mel-cepstral analysis [2]. A code- book training algorithm suited to the proposed quantization process is presented. A similar idea is also seen in LPC-VQ [3, 4]. How- ever, the proposed method has the following advantages which do not exist in LPC-VQ. © 2007 Wiley Periodicals, Inc. Systems and Computers in Japan, Vol. 38, No. 13, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J85-D-II, No. 8, August 2002, pp. 1273–1283 104

Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

Embed Size (px)

Citation preview

Page 1: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

Vector Quantization of Mel-Cepstral Coefficients UsingDistortion Measure for Spectral Analysis

Toru Takahashi,1 Keiichi Tokuda,1 Takao Kobayashi,2 and Tadashi Kitamura1

1Department of Computer Science, Nagoya Institute of Technology, Nagoya, 466-8555 Japan

2Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, Yokohama, 226-8502 Japan

SUMMARY

This paper proposes a method for vector quantizationof mel-cepstral coefficients based on a measure for spectralanalysis. It is a method in which the match between theinput signal within the frame and the mel-cepstral coeffi-cients stored in the codebook is evaluated by using a statis-tical measure for spectral analysis, and the quantizedmel-cepstral coefficients are derived. The codebook train-ing algorithm for the proposed method is also presented.The proposed method is further applied to multistage vectorquantization. It is shown by an experiment using a speechdatabase that the proposed method with a codebook size of9 bits can achieve the same quantization distortion as theconventional 12-bit method (vector quantization based onEuclidean distance between mel-cepstral coefficients).© 2007 Wiley Periodicals, Inc. Syst Comp Jpn, 38(13):104–114, 2007; Published online in Wiley InterScience(www.interscience.wiley.com). DOI 10.1002/scj.10432

Key words: multistage vector quantization; vectorquantization; statistical measure; mel-cepstral analysis.

1. Introduction

Applications based on speech coding techniques,such as portable phones and Internet telephony, have re-cently come into widespread use as important aspects of thesocial infrastructure. It is considered very important to

encode speech signals efficiently in order to reduce thecommunication cost and the cost of speech informationstorage in these applications. Consequently, various speechcoding methods have been proposed. However, most ofthem are based on some spectral analysis procedure. Thus,a crucial factor that determines the system characteristics isthe choice of a representation of the spectrum and of themethod used to quantize and transmit the parameters of thespectral analysis.

The general approach in vector quantization of theparameters of spectral analysis [1] has been to convert thetime-series signal that is the target of analysis to quantizedparameters in two stages, the analysis and quantizationstages. In such an approach, there are different criteria inthe two stages. On the other hand, it seems possible that theparameters may be quantized more efficiently by handlingthe time-series signal that is the target of analysis by meansof a unified criterion throughout the analysis and quantiza-tion.

Consequently, this paper proposes a vector quantiza-tion method in which a unified measure common to theanalysis and quantization stages is applied to a spectralmodel with exponential function representation (exponen-tial spectral model) using mel-cepstral analysis [2]. A code-book training algorithm suited to the proposed quantizationprocess is presented.

A similar idea is also seen in LPC-VQ [3, 4]. How-ever, the proposed method has the following advantageswhich do not exist in LPC-VQ.

© 2007 Wiley Periodicals, Inc.

Systems and Computers in Japan, Vol. 38, No. 13, 2007Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J85-D-II, No. 8, August 2002, pp. 1273–1283

104

Page 2: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

• In LPC-VQ, an all-pole spectral model with linearprediction coefficient representation is used torepresent the spectral envelope. Thus, it is difficultto represent the valleys of the spectral envelope,which correspond to the zeros. In the proposedmethod based on the exponential spectral model,on the other hand, the poles and zeros are handledon an equal basis, and the peaks and valleys of thespectral envelope can be represented with thesame precision.

• In LPC-VQ, the codeword is represented by thelinear prediction coefficients. As a result, the fre-quency resolution of the spectral envelope repre-sented by the linear prediction coefficientsremains constant on the frequency axis. In theproposed method, on the other hand, the codewordis represented by the mel-cepstral coefficients. Asa result, a spectrum matched to the human hearingcharacteristics, in which the resolution is higher atlow frequencies, can be represented efficiently.

• Multistage vector quantization, which is difficultin LPC-VQ, can be easily handled.

In this paper, an experiment using a speech databaseis presented. The proposed method, LPC-VQ, and the con-ventional method (the conventional method of vector quan-tization in which the mel-cepstral analysis is applied in theanalysis stage and quantization is performed on the basis ofthe Euclidean distance between coefficients in the quanti-zation stage) are evaluated quantitatively from the view-point of the average distortion.

Below, Section 2 describes the mel-cepstral analysis.Section 3 describes vector quantization by the proposedmethod, together with the codebook training algorithm.Section 4 extends the proposed method given in Section 3to multistage vector quantization. Section 5 presents theresults of a quantization experiment with three differentquantization procedures, namely, the methods proposed inSections 3 and 4, LPC-VQ, and the conventional vectorquantization. Section 6 summarizes the paper and identifiesproblems to be addressed in the future.

2. Mel-Cepstral Analysis

The synthesis filter H(z) used to represent the spec-trum envelope is represented as follows by using the mel-cepstral coefficients:

where α is a constant related to frequency warping, satisfy-ing

When α = 0, the c(m) are equivalent to the cepstral coeffi-cients [5]. When α < 0, a spectral representation with highresolution in the high-frequency range is obtained. When α> 0, a spectral representation with high resolution in thelow-frequency range is obtained. By setting α to an appro-priate value, a representation matched to the human hearingcharacteristics, in which the resolution is high at low fre-quencies, is obtained.

When a frame of the input signal x = [x(0), . . . , x(N– 1)]T is given, the mel-cepstral coefficients c = [c(0), . . . ,c(M)]T are determined by minimizing the spectrum evalu-ation function [5]:

with respect to c [2].Here,

w(n) is a window function with a window length of N. SinceE(x, c) is a convex function of c, the mel-cepstral coeffi-cients can be derived quickly by using an iterative algorithm(such as the Newton–Raphson method).

Assuming that x(n) is a Gaussian process with zeromean, the logarithmic likelihood log P(x|c) of c with re-spect to x can be approximated as follows:

Thus, it can be seen that the minimization of E(x, c) withrespect to c is equivalent to the maximization of log P(x|c)with respect to c (see Appendix). It is also evident that thespectrum evaluation function used in mel-cepstral analysisis represented in the same form as in linear predictionanalysis [6]. Furthermore, by factoring out the gain term gfrom H(e

jω), it can be seen that the minimization of E(x, c)with respect to c is equivalent to the minimization of the

(3)

(4)

(5)

(6)

(7)

(8)

(1)

(2)

105

Page 3: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

residual prediction energy or the maximization of the pre-diction gain.

3. Vector Quantization of Mel-CepstralCoefficients Using a Measure for Spectral

Analysis

This section presents a method for vector quantiza-tion of speech signals based on a unified statistical measure,as in LPC-VQ. In LPC-VQ, the speech signal is representedby an all-pole model, but in the proposed method the speechsignal is represented by a pole-zero spectral model with anexponential function representation. For use in the quanti-zation, a codebook suited to the proposed method can beobtained by using a training speech signal, based on theunified statistical measure. Such a codebook training algo-rithm is also presented in this section.

3.1. LPC-VQ

LPC-VQ is represented by a linear prediction model.In ordinary linear prediction analysis without quantization,the synthesis filter G(z) used to represent the spectral enve-lope is expressed as follows by using linear predictioncoefficients [6]:

where σ represents the gain term.When a frame of the input signal x = [x(0), . . . , x(N

– 1)]T is given, a = [σ, a(1), . . . , a(M)]T is determined byminimizing the following spectrum evaluation functionwith respect to a:

Here,

and w(n) is a window function with a window length of N.This minimization problem can be solved analytically.

By defining a in this way, the linear prediction coef-ficients can be considered optimal if spectrum evaluationfunction (11) takes a small value. Thus, when the inputframe x is given, it is quantized on the basis of the samecriterion as in linear prediction analysis, by selecting acodeword from the codebook so that spectrum evaluationfunction (11) is minimized.

Then the quantization index for the input frame x isexpressed as follows:

where aj is the j-th codeword and I is the codebook size(number of codewords).

The average distortion of the input signal X = {x1, x2,. . . , xT} for a given codebook A = {a1, a2, . . . , aI} isexpressed as follows:

where

3.2. Vector quantization of mel-cepstralcoefficients

The conventional vector quantization can be consid-ered as quantization of a vector composed of the mel-cep-stral coefficients in two stages, the analysis andquantization stages. The vector quantization in the conven-tional method is composed as shown in Fig. 1.

For the input speech frame x ∈ RN, the correspondingmel-cepstral coefficients c ∈ RM+1 are analyzed as follows,with spectrum evaluation function (4) as the criterion:

Then the quantization index is defined as follows, based onsome distance measure E′(c, c), which is defined before-hand on RM+1:

where cj is the j-th codeword.In such a quantization in two stages, there is a prob-

lem that different measures (E and E′) are produced for the

(12)

(13)

(14)

(11)

(15)

(9)

(10)

(16)

(17)

(20)

(21)

(18)

(19)

106

Page 4: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

analysis and quantization stages, and thus that the quanti-zation is not based on a unified criterion.

In Section 2, the optimal mel-cepstral coefficientswere defined as the coefficients minimizing spectrumevaluation function (4). In other words, it was consideredthat the mel-cepstral coefficients are close to optimal whenthe spectrum evaluation function (4) takes a small value, asin linear prediction analysis. Consequently, an input framex can be quantized on the basis of the same criterion as inthe mel-cepstral analysis by selecting the codeword fromthe codebook so that spectrum evaluation function (4) isminimized.

The vector quantization by the proposed method isformed as in Fig. 2. The quantization index for input framex is represented as follows:

where cj is the j-th codeword and I is the codebook size(number of codewords). In the above quantization proce-dure, the input frame is quantized by directly applying thespectrum evaluation function used in the mel-cepstralanalysis. Consequently, the problem of unification of the

evaluation measures between analysis and quantizationstages that occurs in the conventional method is eliminated.

3.3. Codebook training algorithm

This section discusses the codebook training algo-rithm in the conventional vector quantization and the opti-mal codebook training algorithm to be used in vectorquantization based on the statistical measure.

In the conventional vector quantization, the problemarises that the same criterion is not used in the codebooktraining stage and the analysis stage. Letting the set ofvectors for codebook training be X = {x1, x2, . . . , xT}, thetraining vectors are converted to the set of vectors C′ ={c1

g , c2g , . . . , cT

g } composed of the cepstral coefficients, inaccordance with Eq. (20), based on spectrum evaluationfunction (4):

where

The average distortion of codebook C = {c1, c2, . . . , cI},and also the cluster distortion and the quantization index ofthe i-th cluster, are represented as follows:

Then, the optimal codebook C* = {c1∗, c2

∗, . . . , cI∗ } is repre-

sented as

In the conventional vector quantization, the codebook train-ing algorithm is formed as shown in Fig. 3. This trainingalgorithm can be considered as a procedure in which the setof training vectors is converted to the set of cepstral vectorsby using spectrum evaluation function (4), and the code-book is trained for the obtained set with some distancemeasure E′ as the criterion. That is, the procedure is not acodebook training algorithm based on a unified criterion.

In vector quantization based on the statistical meas-ure, on the other hand, the codebook is trained on the basis

(22)

(24)

(29)

(25)

(26)

(27)

(28)

Fig. 1. The block diagram of conventional vectorquantization.

Fig. 2. The block diagram of proposed vectorquantization.

(23)

107

Page 5: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

of the statistical criterion. For the set of codebook trainingvectors X = {x1, x2, . . . , xT}, the average distortion of thecodebook C = {c1, c2, . . . , cI}, and also the cluster distortionand quantization index of the i-th cluster, are represented asfollows:

Then, the optimal codebook C∗ = {c1∗, c2

∗, . . . , cI∗ } is repre-

sented as

In the proposed vector quantization, the codebook trainingalgorithm is formed as shown in Fig. 4. Figure 5 shows thecodebook training algorithm of the k-means version basedon the average distortion D(X, C). In step 0, an appropriateinitial codebook C0 is given to the algorithm and we set m= 0. Then the updating of the codebook is continued untilsome termination condition (such as the condition that thechange of the average distortion is sufficiently small) isreached.

In step 1 for reestimation of the codebook, the set oftraining vectors is decomposed into clusters by using code-book Cm and Eq. (33). In step 2, the optimal codebookCm+1 is recalculated for the cluster decomposition deter-mined in step 1. The i-th updated value ci

g is determined by

minimizing the cluster distortion given by Eq. (32) withrespect to ci: that is, it is set as follows:

Here,

(34)

Fig. 4. The block diagram of codebook trainingalgorithm for proposed vector quantization.

Fig. 5. Codebook training algorithm (k-means).

(35)

(36)

(37)

(38)

(39)

Fig. 3. The block diagram of codebook trainingalgorithm for conventional vector quantization.

(30)

(31)

(32)

(33)

108

Page 6: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

w(n) is a window function of window length N, and Ti isthe total number of training vectors quantized as the i-thcluster.

Equation (37) is of the same form as Eq. (4). Conse-quently, the minimization problem for Eq. (37) can besolved quickly by using an iterative algorithm (such as theNewton–Raphson method), as in Eq. (4).

Finally, in step 3, the termination condition for thealgorithm is investigated. If the termination condition is notsatisfied, we set m + 1 → m. Then the procedure goes backto step 1 and the codebook is reestimated. If the terminationcondition is satisfied, Cm+1 gives the codebook to be de-rived.

As in the conventional k-means algorithm, it is shownthat the average distortion decreases monotonically withrepetition of steps 1, 2, and 3. The codebook trainingalgorithm of the k-means version is shown above, but thealgorithm can easily be extended to the algorithm of theLBG version [1]. In the experiment described in Section 5,the LBG-version algorithm with codebook decompositionis used to train the codebook.

4. Multistage Vector Quantization

In the vector quantization, it is possible to reduce thequantization error by increasing the number of quantizationbits B. When B is increased, however, the amount of com-putation for quantization, decoding, and codebook trainingincreases in proportion to 2B. A structural vector quantiza-tion technique such as multistage vector quantization [1] isapplied in order to deal with this situation, when it isrequired to reduce the quantization error while maintaininga realistic computation time.

However, LPC-VQ has the problem that the applica-tion of multistage vector quantization is difficult. On theother hand, the proposed method is based on the exponen-tial spectral model and can easily be applied to multistagevector quantization. In the rest of this section, the multi-stage vector quantization in the proposed method is dis-cussed, and the training algorithm is presented.

4.1. Multistage vector quantization ofmel-cepstral coefficients

The K-stage vector quantization of the input frame xis performed as follows. Letting the codebook at the k-thstage be C

(k) = {c 1(k), c 2

(k), . . . , c I(k)}, the quantization index

i (k) ∈ {1, 2, . . . , I} in each stage is formulated as follows:

where

The multistage vector quantization in LPC-VQ can beformulated formally in the same way as in the proposedmethod. However, it is not easy to achieve training so thatthe codebook at each stage is optimized on the basis of thequantization criterion. Furthermore, in LPC-VQ the sum ofthe quantization results of the stages must represent a stablesystem function. This condition makes the optimizationeven more difficult.

On the other hand, the situation in the proposedmethod is as follows. The system function H(z) of thesynthesis filter reconstructed by K-stage vector quantiza-tion is obtained as follows. It follows from the definition ofthe multistage vector quantization that

Consequently,

In other words,

Thus, the input to the quantizer of the k-th stage is consid-ered as the signal x(1) obtained by inverse filtering of x,using the system function H

(1)(e jω) ⋅ H

(2)(e jω) ⋅ ⋅ ⋅

H (k−1)(e

jω) up to the (k – 1)-th stage. Above we set x(1) = x.

(40)

(41)

(42)

(43)

(44)

(46)

(47)

(48)

(45)

(49)

109

Page 7: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

The configuration of multistage vector quantization by theproposed method using inverse filtering is shown in Fig. 6.

4.2. Codebook training algorithm

In K-stage vector quantization, let the set of codebooktraining vectors be

and let the input at the k-th stage be

Then the average distortion for the set of training vectors,and also the cluster distortion and the quantization index forthe i-th cluster, are represented as follows, in the same wayas in Eqs. (31), (32), and (33):

The optimal codebook at the k-th stage is

Since Eq. (56) takes the same form as Eq. (34), the algo-rithm given in Section 3.3 can be directly applied as thecodebook training algorithm in each stage.

5. Experiments

In order to evaluate the proposed method quantita-tively, three codebooks (CB1, CB2, and CB3) were con-structed and the average distortions were compared. Themethod obtained by unifying the criteria of linear predictionanalysis and vector quantization is called LPC-VQ. Simi-larly, the proposed method in which the criteria of mel-cep-stral analysis and vector quantization are unified is calledMel-Cepstrum-VQ. In the description of the experiment,Mel-Cepstrum-VQ is abbreviated as MC-VQ.

Three codebooks were used in training. The methodof training the codebook using Eq. (23), and also Eq. (29)with E′ as the Euclidean distance function, is defined as theconventional method. CB1 is the codebook trained by theconventional method, CB2 is the codebook trained by thetraining algorithm proposed for MC-VQ, and CB3 is thecodebook trained by the training algorithm for LPC-VQ.

Fig. 6. The block diagram of multistage vectorquantization.

(50)

(52)

(53)

(54)

(55)

(56)

Table 1. Experimental conditions

(51)

110

Page 8: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

In CB1 and CB2, the order of the mel-cepstral analy-sis is set as 10. The codeword is an 11-dimensional vector,including 0th-order coefficients, which represents the gain.In LPC-VQ, the order of the linear prediction analysis is setas 10 and the codeword is an 11-dimensional vector includ-ing the gain term. As the data for codebook training, 216words uttered by 40 subjects (20 males and 20 females) inthe ATR Japanese speech database C-set were used. As thetest data, 520 words uttered by 40 subjects (20 males and20 females) other than those used for training were used.Table 1 shows the detailed conditions of the experiment.

In the comparison of the average distortion, the aver-age distortions for three different trained codebooks werederived and evaluated by four methods. Table 2 shows thecodebook training criteria, the criteria for decoding, theapplied spectral model, and the codebook investigated foreach of Baseline, MC-VQ′, LPC-VQ, and MC-VQ.

For Baseline, which is characterized as the conven-tional method, the average distortion was derived by Eqs.(26), (27), and (21), using codebook CB1. The averagedistortion for MC-VQ′ was derived by Eqs. (31), (32), and(33), as in the proposed method. For Baseline and MC-VQ′,the average distortion was derived using the same code-book, but the derivation of the quantization index wasdifferent. By comparing the two results, the effect of thederivation of the quantization index on the average distor-tion can be determined.

For MC-VQ, which is characterized as the proposedmethod, the average distortion was derived by Eqs. (31),(32), and (33) using codebook CB2. The average distortionfor LPC-VQ was derived by using Eqs. (17), (18), and (19)using codebook CB3. In methods MC-VQ and LPC-VQ,the quantization procedure uses the unified measure basedon the spectral evaluation function. The spectral modeldiffers in MC-VQ and LPC-VQ, and the effect of modelingon the average distortion can be determined by comparingthe two results.

In both MC-VQ′ and MC-VQ, the quantization indexis derived by the proposed method. However, the trainingalgorithm for the codebook is different. Consequently, theimprovement obtained by using the codebook training al-

gorithm for the proposed quantization method can be de-termined by comparing the two results.

5.1. Quantization experiment

Figure 7 shows the average distortion D(⋅,⋅) for thetraining data. It is evident that when the number of quanti-zation bits is the same, the average distortion for the trainingdata decreases in the order of Baseline, MC-VQ′, LPC-VQ,and MC-VQ. Comparing Baseline and MC-VQ′, the effec-tiveness of quantization index derivation by the proposedmethod is clear. By comparing MC-VQ′ and MC-VQ, it isshown that the average distortion can be further reduced byapplying the codebook training algorithm of the proposedmethod. As can be seen from Fig. 7, however, although thequantization distortion in MC-VQ is smaller than that inLPC-VQ, the difference is small, and it is concluded thatalmost the same performance is achieved as regards thequantization distortion for the training data.

Figure 8 shows the average distortion D(⋅,⋅) for thetest data set. It is evident that when the number of quanti-zation bits is the same, that is, 5 bits or less, the averagedistortion decreases in the order of Baseline, MC-VQ′,LPC-VQ, and MC-VQ. When the number of quantizationbits is 6 or more, the average distortion decreases in theorder of Baseline, LPC-VQ, MC-VQ′, and MC-VQ. Theorder of the average distortion is reversed between LPC-VQand MC-VQ′. For the three different procedures Baseline,MC-VQ′, and MC-VQ, however, the order is retained. Thatis, the effectiveness of quantization index derivation by theproposed method and the effectiveness of the codebooktraining algorithm of the proposed method are also shownfor the test data.

Table 2. Training criterion and decoding criterion

Fig. 7. Average distortion for training data set.

111

Page 9: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

It should also be noted that the quantization distortionfor the training data in MC-VQ is almost the same as thatin LPC-VQ. For the test data, on the other hand, almost thesame quantization distortion in 12-bit quantization byBaseline and LPC-VQ is achieved by 9-bit and 11-bitquantization, respectively, by MC-VQ. Thus, it is con-cluded that the effectiveness of MC-VQ has been welldemonstrated.

5.2. Two-stage quantization experiment

Figures 9 and 10 show experimental results for two-stage vector quantization. Since multistage vector quanti-zation cannot be applied to LPC-VQ, these experimentalresults do not include LPC-VQ.

Figure 9 shows the average distortion D(X, C) for thetraining data set. It is clear that when the number of quan-tization bits is the same, the average distortion for thetraining data set decreases in the order of Baseline, MC-VQ′, and MC-VQ. This result gives the same tendency asin the result of the quantization experiment described in theprevious section (Fig. 7). Thus, it is verified that the pro-posed method remains better than the conventional methodeven if the multistage quantization is applied.

Figure 10 shows the average distortion D(X, C) forthe test data set. It is evident that when the number ofquantization bits is the same, the average distortion for thetest data set decreases in the order of Baseline, MC-VQ′,and MC-VQ. It is also clear that the average distortion of24-bit (12+12) Baseline is achieved by 22-bit (11+11)MC-VQ, indicating the effectiveness of the proposedmethod. This result gives the same tendency as in the resultof average distortion for the training data (Fig. 9). Thus, it

is indicated that the proposed method is better regardless ofthe speaker or the vocabulary.

In Fig. 8, the average distortions for 12-bit LPC-VQand 12-bit Baseline are 1.14 and 1.23, respectively. In Fig.10, the average distortions for 10-bit (5+5) MC-VQ and12-bit (6+6) MC-VQ are 1.23 and 1.15, respectively. Thus,it is indicated that although the proposed method has atwo-stage configuration with the same number of quantiza-tion bits, it achieves the same average distortion as in 12-bitquantization in the single-stage LPC-VQ. In addition, it isshown that although the proposed method has a two-stage

Fig. 8. Average distortion for testing data set. Fig. 9. Average distortion for training data set (2-stage VQ).

Fig. 10. Average distortion for testing data set (2-stage VQ).

112

Page 10: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

configuration with two fewer quantization bits, it achievesthe same average distortion as single-stage 12-bit quantiza-tion in the conventional method.

Thus, the proposed method can easily deal with anincrease in the number of quantization bits by applyingmultistage vector quantization. It should also be noted thatquantization or codebook training in a realistic computationtime is impossible in LPC-VQ, since the multistage con-figuration is difficult. But it is possible in the proposedmethod to minimize the quantization distortion with arealistic amount of computation, by applying multistagevector quantization.

The above experiments show that the proposedmethod can achieve smaller distortion than the conven-tional vector quantization of the mel-cepstral coefficients.The effectiveness of the proposed multistage vector quan-tization is verified. Furthermore, it was verified through anexperiment with 80 speakers that the quantization distortioncan be reduced more by the proposed method than byLPC-VQ.

6. Conclusions

This paper has proposed a method of vector quanti-zation of the mel-cepstral coefficients based on a statisticalmeasure, and has presented a codebook training algorithmsuited to the proposed method. It was then shown that theproposed method can be applied to multistage vector quan-tization. It was shown by quantitative evaluation experi-ments that the proposed method can achieve smallerquantization distortion than LPC-VC or the conventionalvector quantization of mel-cepstral coefficients, using asmaller number of quantization bits.

Equations (17) and (31) used in the evaluation in thispaper are measures which are directly related to factorsconducive to performance improvement in speech coding,such as minimization of the energy of the residual predic-tion error. In order to improve performance in real applica-tions, however, other measures will be needed forevaluation. In the future we plan to apply the proposedmethod to a highly efficient coding system and to evaluatethe speech quality, thus demonstrating the effectiveness ofthe method in application.

REFERENCES

1. Gersho A, Gray RM. Vector quantization and signalcompression. Kluwer Academic; 1992.

2. Tokuda K, Kobayashi T, Fukada T, Saito T, Imai S.Speech spectrum estimation with mel-cepstrum asparameter. Trans IEICE 1991;J74-A:1240–1248.

3. Buzo A, Gray AH Jr, Gray RM, Markel JD. Speechcoding based upon vector quantization. IEEE TransAcoust Speech Signal Process 1980;28:562–574.

4. Juang BH, Wong DY, Gray AH Jr. Distortion per-formance of vector quantization for LPC voice cod-ing. IEEE Trans Acoust Speech Signal Process1982;30:294–303.

5. Imai S, Furui C. Unbiased estimation of logarithmicspectrum. Trans IEICE 1987;J70-A:471–480.

6. Itakura F, Saito S. Estimation of speech spectraldensity and format frequency by statistical technique.Trans IEICE 1970;53-A:35–42.

APPENDIX

Equivalence of Minimization of SpectralEstimation Function and Maximizationof Logarithmic Likelihood Function

The logarithmic likelihood function log P(x|c) can berepresented as follows, using the spectral evaluation func-tion E(x, c) = 1/2π ∫−π

π {R(ω) − logR(ω) − 1}dω and c:

where

Since A and B are constants independent of c, the maximi-zation of log P(x|c) with respect to c is equivalent to theminimization of E(x, c) with respect to c.

(A.2)

(A.3)

(A.1)

113

Page 11: Vector quantization of mel-cepstral coefficients using distortion measure for spectral analysis

AUTHORS (from left to right)

Toru Takahashi (student member) received a B.S. degree from the Department of Intelligent Information Systems,Nagoya Institute of Technology, in 1996, completed the first half of the doctoral program in electric information engineeringin 1998, and is now in the second half. He is engaged in research on speech analysis. He is a member of the Acoustic Societyof Japan and the Information Processing Society.

Keiichi Tokuda (member) received a B.S. degree from the Department of Electronic Engineering, Nagoya Institute ofTechnology, in 1984, completed the doctoral program at Tokyo Institute of Technology in 1989, and became a research associatethere. He has been an associate professor in the Department of Intelligent Information Systems, Nagoya Institute of Technology,since 1996. He holds a D.Eng. degree. He is engaged in research on speech analysis, synthesis, coding, and recognition, digitalsignal processing, and multimodal interface. He received an Electronic Communications Foundation Prize in 2001, and anIEICE Paper Award and Inose Prize in 2001. He is a member of the Acoustic Society of Japan, the Society for ArtificialIntelligence, the Information Processing Society, IEEE, and ISCA.

Takao Kobayashi (member) received a B.S. degree from the Department of Electrical Engineering, Tokyo Institute ofTechnology, in 1977, completed the doctoral program in 1982, and became a research associate at the Institute of PrecisionEngineering, Tokyo Institute of Technology. He was subsequently appointed an associate professor. Since 1998, he has been aprofessor of physical information system creation at the Graduate School of Science and Engineering. He holds a D.Eng. degree.He is engaged in research on digital filters, speech analysis, synthesis, coding, and recognition, and multimodal interface. Hereceived an Electronic Communications Foundation Prize in 2001, and an IEICE Paper Award and Inose Prize in 2001. He isa member of the Acoustic Society of Japan, the Information Processing Society, IEEE, and ISCA.

Tadashi Kitamura (member) received a B.S. degree from the Department of Electronic Engineering, Nagoya Instituteof Technology, in 1973, completed the doctoral program at Tokyo Institute of Technology in 1978, and served as a researchassociate at the Institute of Precision Engineering. He became a lecturer in 1983 in the Department of Electronic Engineering,Nagoya Institute of Technology, and was appointed an associate professor in 1984. Since 1995 he has been a professor in theDepartment of Intelligent Information Systems, Nagoya Institute of Technology. He holds a D.Eng. degree. He is engaged inresearch on speech information processing and multimedia information processing. He is a member of the Acoustic Society ofJapan, the Information Processing Society, IEEE, and ISCA.

114