Tandem Connections of Wideband and Narrowband Speech Communication Systems Part 2-Wideband-to-Narrowband Link

Copyr ight i.η> 1977 Amer ican Te lephone and Te legraph C o m p a n y T H E B R I A S Y S T E M T E C H N I C A L J O U R N A L

Vol . 56, N o . 9, November 1977 Printed in U.S.A.

Tandem Connections of Wideband and Narrowband Speech Communication Systems

Part 2—Wideband-to-Narrowband Link

By L. R. RABINER, M. R. SAMBUR, R. E. CROCHIERE, and D. J. GOODMAN

(Manuscript received March 25, 1977)

In this paper the tandem link of a 16 kbfs Continuously Variable Slope Delta modulator (CVSD) waveform coder and a 2.4 kb/s Linear Predictive Coding (LPC) vocoder is studied. Of prime concern are the effects of the CVSD coder on the LPC vocoder analyzer. In particular the problems involved in making a reliable voiced-unvoiced decision, estimating pitch period, and estimating LPC coefficients from the coder output are studied. It is shown that LPC coefficient estimation from the CVSD output is highly inaccurate. An analytical distortion measure (an LPC distance) is used to show the magnitude of the distortion introduced by the coder as a function of the signal gain into the CVSD coder. Although the remainder of the LPC analysis (i.e., pitch detection, voiced-unvoiced decision, and gain calculation) can be performed reasonably accurately, the magnitude of the distortions in estimating the LPC coefficients is sufficiently large to make the vocoded speech barely intelligible and of poor quality.

I. OVERVIEW OF THE TANDEM LINK OF CVSD TO LPC

In the first pa r t of th is paper we discussed the effects of the narrowb a n d system ( the L P C vocoder operat ing a t 2400 b/s) on the wideband system ( the C V S D waveform coder) . 1 The re it was shown tha t one of the major issues was tailoring the signal characteristics of the vocoded speech to reduce the peak factor, thereby reducing the amount of slope overload noise generated in the C V S D . When we consider the tandem link of C V S D

and L P C , far more serious problems are encountered since we m u s t est ima te t h e basic speech product ion pa ramete r s (i.e., pi tch, voiced-unvoiced, L P C coefficients) from a severely degraded signal. Since speech p a r a m e t e r es t imat ion is an imperfect process, even on high-quali ty speech, t h e effects of the C V S D coder, which include quant iza t ion noise

1723

ttn) 10 kHz INTERPOLATOR 16 kHz BANDPASS

10-16 kHz FILTER

G 1/G

^ _ J V \ _ C V S D - / Q \ R ^ ^ £ y ^ CODER ^ v i y ' H

200-3200 Hz

l(n) LPC 10 kHz DECIMATOR 16 kHz BANDPASS VOCODER 16-10kHz FILTER

200-3200 Hz

Fig. 1—Block d i a g r a m of s ignal process ing opera t ions in t a n d e m link of a CVSD coder a n d a n LPC vocoder .

as well as slope overload noise, could potent ia l ly make the t a n d e m link total ly unacceptable .

In th is paper we discuss several aspects of a t a n d e m link consisting of a C V S D waveform coder, and an L P C vocoder. Our purpose is to demons t ra t e the range of signal levels over which the L P C can opera te reasonably well in t a n d e m with the C V S D coder. Figure 1 shows a block diagram of the signal processing used in implement ing and test ing a C V S D - L P C t andem link. T h e speech signal s(n) is assumed to be sampled a t a 10-kHz rate . T h u s t h e first block in Fig. 1 is an interpolator to raise the sampl ing ra te of t h e signal to 16 kHz. T h e in terpola tor described in P a r t 1 of this paper was used here . 1 T h e 16-kHz signal was then sharply bandpass-f i l tered from 200 Hz to 3200 Hz using the 8th-order elliptic bandpass filter described in P a r t 1 of this paper . 1 T o simulate variations in overall signal level into the C V S D coder, a variable gain G was applied to the filtered 16-kHz signal. T h e gain G was varied from 0.009375 to 2.5 in the simulations which gave about a 50-dB variation in signal level over which t h e system was s tudied . T o compensa te for the inpu t scaling, a gain of 1 /G was used a t the ou tpu t of the C V S D coder. T h e ou tpu t of the coder was again sharply bandpass-filtered from 200 to 3200 Hz to remove the wideband quant iza t ion noise genera ted in the C V S D coder. For compat ibi l i ty with t h e L P C sys tem the signal was then decimated to a 10-kHz sampl ing ra te using the decimator described in P a r t 1 of this paper .

Figure 2 shows a block diagram of the processing required for the L P C vocoder. T h e L P C analyzer es t imates the following control pa rameters :

(i) P i t ch period (ii) Voiced-unvoiced decision (iii) Signal gain (iv) L P C pa rame te r s

1724 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1977

PITCH DETECTOR

V O I C E -U N V O I C E D DECISION

LPC PARAMETER ESTIMATION

GAIN COMPUTATION

LPC A N A L Y Z E R

PITCH PULSE

G E N E R A T O R

NOISE G E N E R A T O R

1/ +

P"> ORDER LINEAR SYSTEM

Σ »» y In - * )

LPC SYNTHESIZER

Fig. 2—Block diagram of I . P C analyzer and synthesizer.

T h e L P C synthesizer uses the es t imated paramete rs to recreate the speech in the manne r shown in Fig. 2. T h e detai ls of the analysis and synthesis me thods are described in P a r t 1 of this paper.

Based on our knowledge of bo th the techniques used in L P C analysis and t h e degradat ions in t roduced by the C V S D coder, it was ant ic ipated t h a t the voiced-unvoiced decision and the L P C pa ramete r est imation algorithms would be most affected by the CVSD coder. Thus , in the next two sections we discuss the specific algorithms used for voiced-unvoiced detect ion (along with pi tch detect ion) and show results on how the algor i thms performed in the t andem link as a function of the signal level into the C V S D coder. In Section IV we present results on the accuracy with which the L P C pa ramete rs were es t imated from the coder ou tput . For a measure of similarity between coder inpu t and ou tput , t he L P C dis tance measure proposed by I takura is used. Finally, in Section V we discuss t h e interact ions between the C V S D coder and the L P C vocoder and suggest some possible ways to improve the performance of a tandem link of a wideband and a nar rowband system.

II. PITCH DETECTOR AND VOICED-UNVOICED DETECTOR USED IN THE TANDEM LINK

As discussed in the preceding section, the choice of an appropr ia te pi tch detec tor and voiced-unvoiced detector is critical to the proper operation of the L P C vocoder. Based on a series of intensive investigations in to bo th objective and subjective rankings of a variety of pitch detecto r s , 2 - 3 it was shown t h a t s imple waveform pitch detectors would be in-

WIDEBAND-TO-NARROWBAND LINK 1725

g S! »

a < o > 3

L J O . Û. UJ —

o. ce r-

< LU ,

< I-uj m

S o | m - ι E

> ° Û uj *~ Ζ _1

ζ Ο

CL 3 Œ "J

? · * Ο

ζ ζ Ο Ο

I I - I -Ο < <

< CC û-

Ο Ο u ο

ο. <

se Ο < Ι Ι LU Η (Λ α. ^ LU m * OC ne - ι x

<<·-α. Ζ UJ S O U Ο Lφ ζ

ο LU

>-Ο

2ft 5 Ο

Ζ Ο LU Χ

LU Χ Μ 11- ~


adequa t e for a severely degraded waveform such as obtained a t the o u t p u t of a CVSD coder. T h u s ei ther a sophist icated correlat ion-type p i tch detector , or a spectra l - type pi tch detector is required for this ap plication. F rom this class of pi tch detec tors bo th t h e AMDF4 and AUTOC5 p i tch detectors were found to be moderate ly fast, and sufficiently robus t over a wide variety of t ransmission condit ions and pi tch range of the speaker. Because of the familiarity of the au thors with the AUTOC p i tch detector , th is me thod was finally selected.

Before the method of operation of this pitch detector is reviewed, some commen t s m u s t be m a d e abou t the selection of the voiced-unvoiced detector . Ideally one would prefer to make a voiced-unvoiced decision prior to, and independent of, the pitch detection. In this manner the role of the pi tch detector is strictly to make the best est imate of pitch period, given a priori t h a t the segment is accurately classified as voiced. For unvoiced segments, the pitch detector is not used a t all. There have been a t least three proposed methods for making a voiced-unvoiced decision prior to a n d independen t of any pi tch d e t e c t i o n . 6 8 However, all three me thods suffer from the necessity of having a t ra ining set of da t a t h a t character izes t h e signal classes. For CVSD coding, t h e variabili ty of the signals due to var ia t ions in gain is exceedingly large—i.e., a 40-dB varia t ion in i npu t level can change the signal from one with a large amoun t of granular noise to one with a large a m o u n t of slope overload noise. Therefore , making a voiced-unvoiced decision accurately wi thout a periodicity measurement (pitch detector) to aid the decision is extremely difficult. Thus , the voiced-unvoiced decision is combined with the pitch detec t ion in the AUTOC method.

A block diagram of the AUTOC pi tch detector is given in Fig. 3. T h e method requires t h a t the speech be lowpass-filtered to 900 Hz. T h u s a 99-point linear phase, FDR digital filter is used here. 9 The lowpass-filtered speech is sectioned into overlapping 30-msec (300 samples a t 10 kHz) sections for processing. Since the pitch period computat ion for all pitch detectors is performed 100 times/second—i.e., every 10 msec—adjacent sections overlap by 20 msec or 200 samples.

T h e first stage of processing is the computat ion of a clipping level CL for the current 30-ms section of speech. T h e clipping level is set at a value which is 64 percen t of the smaller of peak absolute sample values in the first and last 10-ms port ions of the section. Following the determinat ion of the clipping level, t he 30-ms section of speech is center clipped, and t hen infinite-peak-clipped, result ing in a signal which assumes one of th ree possible values, 1 if the sample exceeds the positive clipping level, —1 if the sample falls below the negative clipping level, and 0 otherwise.

Following clipping, the autocorrelation function for the 30-ms section is computed over a range of lags from 20 samples to 200 samples (i.e.,


SEMIAUTOMATIC PITCH DETECTOR

SAPD

REFERENCE PITCH CONTOUR

ι CA NONLINEAR / I SMOOTHER [ I

ΡNONLINEAR I Γ

SMOOTHER | \ 'V ERROR

ANALYSIS

AUTOCORRELATION PITCH DETECTOR

AUTOC

Fig. 4—-Block diagram of system used to compare pitch contours from two pitch detectors and to perform an appropriate error analysis.

2-msec to 20-msec period). Additionally, t h e autocorrelat ion a t 0 delay is computed for appropriate normalization purposes. T h e autocorrelation function is t hen searched for its max imum (normalized) value. If t he m a x i m u m (normalized value) exceeds 0.25, the section is classified as voiced and the location of t h e max imum is the pi tch period. Otherwise, t he section is classified as unvoiced.

In addi t ion to t h e voiced-unvoiced classification based on the autocorrelat ion function, a prel iminary tes t is carried ou t on each section of speech to de te rmine if t he peak signal ampl i tude within the section is sufficiently large to war ran t the pi tch computa t ion . If t he peak signal level within t h e section is below a threshold computed from the background noise level, t he section is classified as unvoiced (silence) and no pi tch computa t ions are made .

III. EFFECTS OF CVSD CODING ON PITCH DETECTION

T o investigate the effects of C V S D coding on pi tch detect ion, two sentences were used whose pi tch contours were known extremely accurately. 9 Figure 4 shows a block diagram of the experimental arrangement used to show pitch detection errors in the tandem link. The speech, s(n), is analyzed by the S A P D m e t h o d 9 to give t h e reference pi tch contour, pr(m), m = 1, 2 , . . . , M, where M is the n u m b e r of 10-msec frames in t h e u t te rance , and p r ( m ) = 0 if t he frame is classified as unvoiced. Otherwise p r ( m ) is the es t imated pi tch period. Extensive tes ts have shown the S A P D me thod to be a reliable and robus t procedure for obta in ing the reference pi tch contour . 9

T h e tes t pi tch contours are obtained by sending the speech ei ther directly to the pi tch detector, or first through the C V S D coder where the

1728 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1 9 7 7

signal level is determined by the gain G. We denote the test pitch contour as Pt(m), m - 1 , 2 , . . . , Λί. The error analysis compares pr(m) and pt (m) over the u t t e rance and makes the following measurements :

(i) Average pi tch period error dur ing voiced regions, P, defined as

Ρ = ΤΓ Σ \Pr(m) - pt(m)] (1) /V t , m = l

pr{m)xO

Pt(m);*0

| p i ( m ) - p r ( m ) | « I 0

where Íő is the number of voiced regions satisfying the conditions tha t the reference p i tch contour indicates a voiced region {pr(m) τ* 0), the tes t pi tch contour indicates a voiced region ( p t ( m ) j± 0), and the difference in es t imated pi tch period is less t han or equal to 10 samples (\pt(m) - pr(m)\ « 10).

(ii) S t anda rd deviation of the pitch period during voiced regions, σ ρ , defined as

σρ = \ΤΓ Σ ΙρΛτη) - p,(m))* - Ρ*] LAit. m = l J

1/2 (2)

p , (m ) *0 P / ( m ) * 0 |pi(m)-pr(m)|«10

(iii) N u m b e r of voiced-to-unvoiced errors, Nvu, defined as

Nuu = Σ g(Pr(m), pt(m)) (3) m = l

where

g{x,y) = 1 if ÷ > Oand y = 0

= 0 otherwise (4)

(iv) N u m b e r of unvoiced-to-voiced errors, Nuv, defined as

M

NUu= Σ g(pc(m), pr(m)) (5)

(u) N u m b e r of gross pitch period errors, Í a, defined as Λί

Nc:= Σ f(pAm),P,(m)) (6) m = l


Table I — Error analysis for utterance "Every salt breeze comes from the sea"

(a) Analysis on raw pi tch d a t a Signal Ρ σρ Nm Nuv No

Original speech 0.142 0.786 8 7 1 CVSD-G = 0.009375 1.154 1.925 69 73 29 CVSD-G = 0.0395 0.221 0.901 22 18 6 CVSD-G = 0.158 0.252 0.874 7 8 4 CVSD-G = 0.316 0.288 0.961 6 8 5 CVSD-G = 0.632 0.294 0.952 3 12 4 CVSD-G = 1.264 0.397 1.037 5 23 4 CVSD-G = 2.528 0.397 1.159 4 37 10

(b) Analysis on nonl inear ly s m o o t h e d pi tch d a t a Signal Ń "ρ N „ Nu» NG


where

fipAm), pt(m)) = 1 if ρΛτη) * 0,pt{m) * 0 , \Pr(m) - Pt(m ) | > i o

= 0 otherwise ( 7 )

Since many of the errors m a d e in pi tch detect ion are easily corrected by a nonl inear median- type s m o o t h e r , 1 0 t h e tes t a r rangement in Fig. 4 also shows the capabil i ty of passing both the reference and tes t p i tch contours t h rough such a smoother prior to the error analysis. Resul ts will be presented on both t h e raw da ta and the smoothed da ta .

Resul t s obta ined on two different sentences are presented in Tables I and II , and some of the key results are summarized in Figs. 5 - 8 . Ut terance 1 was the sentence "Every salt breeze comes from the sea" spoken by a low-pitched male and recorded off a conventional te lephone line. T h e u t t e r ance had 2 5 6 frames (i.e., it was 2 . 5 6 seconds long), of which 1 0 8 were unvoiced and 1 4 8 were voiced. Tab le I shows values of Ń, óń, N„u, Nuu, and Nq as a function of t h e gain G , for bo th the raw da ta and t h e nonl inear ly smoothed pi tch contours . Figure 5 shows plots of Nvu

versus G (plot ted in d B on a normalized scale) for bo th t h e raw and smoothed da ta , and Fig. 6 shows plots of JVU ٢ versus G . Results obtained on t h e original u t t e rance (uncoded) are also presented as a means of comparison. _

As seen in Tab le I, values of Ń for the coded speech were abou t 2 to 3 t imes larger t han for the original speech (except for G = 0 . 0 0 9 3 7 5 ) .


Table II — Error analysis for utterance "I know when my lawyer is due"

(a) Analysis on raw pi tch d a t a Signal Ρ "ρ •V«, N u u Ne


(b) Analysis on nonl inear ly smoo thed p i tch d a t a Signal Ρ «Ρ Ν „ Nu, Ne


However, values of Ń were all less t han 0.5 samples (except for G = 0.009375) indicating t h a t the average pi tch period errors, due to the coder, were still relatively insignificant. For a gain of G = 0.009375 (large a m o u n t s of granular noise) the pi tch detect ion process broke down en-

100

9 0

BO

70

60

S? 50

4 0

3 0

20

10

0 - 3 0 - 20 - 1 0 0 10 20 30

S IGNAL LEVEL G IN OECIBELS

Fig. 5—Plo t of n u m b e r of voiced-to-unvoiced er rors versus CVSD signal level for u t te r ance " E v e r y salt breeze comes from t h e sea."

W I D E B A N D - T O - N A R R O W B A N D LINK 1731

90

- 3 0 - 2 0 - 1 0 0 10 20 3 0 S IGNAL LEVEL C IN OECIBELS

Fig. 6—Plot of n u m b e r of unvoiced-to-voiced er rors versus C V S D signal level for u t t e r ance " E v e r y sal t breeze comes from t h e sea . "

t irely. T h u s , a t this ex t reme the L P C vocoder cannot possibly operate . However, as was shown previously, for this value of gain the C V S D coder produced unintelligible speech; hence we need not be concerned with this result .

Values for óń for the coded speech were essentially identical to those obtained for the original utterance. Also the number of gross pitch period errors was small for all values of G except G = 2.528 and G = 0.009375,

901 1

B0 - SENTENCE: I KNOW W H E N MY LAWYER IS DUE

70 -/V„ = 162, Nu = 13

6 0 -

- 3 0 - 2 0 - 1 0 0 10 2 0 3 0

S IGNAL LEVEL G IN DECIBELS

Fig. 7 — P l o t of n u m b e r of voiced-to-unvoiced er rors versus C V S D signal level for u t t e r ance "I know when my lawyer is d u e . "


90

80

70

60

50

30

20

10

0

- 3 0 - 20 - 1 0 0 10 20 30 SIGNAL LEVEL G IN DECIBELS

Fig. 8—Plo t of n u m b e r of unvoiced-to-voiced er rors versus CVSD signal level for u t t e r a n c e " I know when my lawyer is d u e . "

and all these errors were correctable by the nonlinear smoother, as shown in T a b l e lb . T h u s , one can conclude t h a t for cases in which bo th t h e reference and test p i tch contours were classified as voiced, the coder did not impede accurate determination of the pitch period—i.e., pitch is well preserved in t h e CVSD ou tpu t .

Now the major question is how well the voiced-unvoiced decision could be made on the coder ou tput . An examination of Table I and Figs. 5 and 6 shows tha t , for several values of G, a subs tan t ia l n u m b e r of unvoiced-to-voiced errors occurred. However mos t of these errors were easily correctable by the nonlinear smoother since the es t imated pitch per iods (when such errors occur) are essentially random, and are au tomatical ly " s m o o t h e d " to zero (i.e., unvoiced). Also some of the voiced-to-unvoiced errors are corrected by t h e smoother .

For th is sentence it is concluded t h a t over a fairly large variation in coder i npu t gain, the deter iora t ion of the signal is not so large so as to make pi tch detect ion unreliable.

A second set of resul ts is given for the u t t e rance " I know when my lawyer is d u e " spoken by another male speaker over a high-qual i ty microphone. This sentence had 175 frames (1.75 seconds) of which only 13 were unvoiced and 162 were voiced. T h u s this u t terance was essentially all voiced. Resul ts obta ined on this u t te rance are given in Tab le II and Figs. 7 and 8. Again it is seen tha t , except for G = 0.009375, values of P, a ń and Í a (smoothed) are essentially the same for the coder o u t p u t as for the original. Since there were very few unvoiced frames, the number of unvoiced-to-voiced errors is also the same for the coded speech as for the original. However, t h e number of voiced-to-unvoiced errors for the

SENTENCE: I KNOW WHEN MY LAWYER IS DUE

, RAW PITCH DATA

• NONLINEARLY SMOOTHED PITCH DATA

ORIGINAL SPEECH


coded speech is much larger t h a n for t h e original speech. Mos t of these errors occur in t h e region of t h e / z / in "is d u e , " and as such are not correctable by t h e nonl inear smoother . However, the errors in this low-intensi ty region are no t very precept ible and therefore such errors are no t overly crucial.

In s u m m a r y we have shown t h a t t h e CVSD coder preserves t h e pi tch of t h e speech over a reasonably large signal range and t h a t t h e voice-unvoiced decision can also be reliably m a d e over a fairly large dynamic range of coder inputs .

IV. EFFECTS OF CVSD CODING OF ESTIMATION OF LPC COEFFICIENTS

T h e nex t issue to consider is the effects of t h e CVSD coder on the est ima t ion of t h e LPC pa ramete r s . T h e LPC coefficients model t h e combined t ransfer function of t h e vocal t rac t , glottal source, and radia t ion load. Incorrect es t imates of t h e coefficients can seriously pe r tu rb t h e frequency spect rum of the modeled speech signal and, hence, affect the intelligibility of the synthesized s o u n d . 1 1

4.1 Distance measure

T o evaluate objectively the spectral distortion introduced by the CVSD coder, an LPC d i s tance measure proposed by I t akura was e m p l o y e d . 1 2

T h e LPC d i s tance measure is defined as

a„ = LPC coefficient vector (1, a i , . . . , a p ) measured in the n t h frame of t h e original uncoded speech signal.

bn = LPC coefficient vector measured in the n t h frame of the CVSD coded speech signal

and V is t he speech correlat ion mat r ix with e lements V i ; defined as

where x(n) is t h e speech signal and Í is t he n u m b e r of samples in the frame.

Figure 9 shows examples which i l lustrate how the measured dn is useful in measur ing t h e degree of spectral deviat ion of t h e coded sound from t h a t of t h e original.* Al though the measure dn is not the only possible indicator of spectral d i s t o r t i on , 1 3 it has been shown to closely

* T h e quantitative significance of d„ is discussed in detail in Ref. 14.

1734 THE BELL SYSTEM TECHNICAL JOURNAL, NOVEMBER 1 9 7 7

(8)

where

(9)

FREQUENCY, Hz

Fig. 9—Plo t s of typical spec t ra and t h e resul t ing values of d„ for t h r e e examples .

correspond to perceptual judgments . 1 4 In addition, the measure has been effectively appl ied in problems of speech recogni t ion , 1 2 speaker recogn i t ion , 1 5 and variable frame rate synthesis . 1 6 Before discussing the results of t h e L P C dis tance evaluat ion of the C V S D coder, it is impor tan t to emphasize t h a t dn is not a perfect measure of perceptual changes in the charac ter of t h e s o u n d . 1 1 ' 1 7 However, it is a good measure of spectral deviat ions, which is a useful indicator of intelligibility lo s s . 1 4

4.2 Evaluation

T h e two sentences utilized in the investigation of pi tch detect ion accuracy were also employed in the evaluat ion of the effects of C V S D distort ion on the est imation of the L P C coefficients. For each sentence, the L P C coefficients for the uncoded, original speech are first calculated. T h e L P C pa ramete r s are calculated 50 t imes per second a t a uniform rate using the autocorrelat ion m e t h o d 1 8 with a 30-msec H a m m i n g window. T h e speech is preemphasized using a first order digital network with t ransfer function

H(z) = 1 - 0 . 9 5 ε - 1 (10)

prior to L P C analysis in order to minimize the effects of performing t h e L P C analysis a t a uniform ra te (i.e., pi tch asychronous ly) . 1 9 T h e resul ts of this analysis provide the reference L P C coefficients (the a„ 's) for each 20-msec frame.


SENTENCE: EVERY SALT BREEZE COMES FROM THE SEA

d„

d„

G - 2.52B

1 256 FRAME NUMBER

"PERCEPTUALLY" SIGNIFICANT BOUNOARY

Fig. 10—Values of dn versus frame n u m b e r as a function of CVSD s ignal level for u t t e r a n c e " E v e r y sa l t breeze comes from t h e sea . "

A similar L P C analysis is performed for each of the various C V S D coded versions of the original sentences. These analyses provide the 6„ 's for use in the calculation of distance (d„) between the original sentence and

1736 THE BELL SYSTEM TECHNICAL JOURNAL. NOVEMBER 1977

SENTENCE I KNOW W H E N MY LAWYER IS DUE

" P E R C E P T U A L L Y " S I G N I F I C A N T B O U N D A R Y

G ' 0 .009375

G - 0.632

G - 2.52B

FRAME NUMBER

Fig. 11—Values of d„ versus frame n u m b e r as a function of C V S D s ignal level for u t t e r ance " I know when my lawyer is d u e . "

t h e par t icular C V S D - c o d e d sentence. Figures 10 and 11 show t h e frame-by-frame L P C dis tance measured for each CVSD-coded version of t h e two original sentences. T h e dashed line in the figures refers to a suggested threshold of d„ = 0.9 for a just-perceptible difference. 1 4 Figure


G A I N IC I

Fig. 12—Plots of average LPC dis tance (d) as a function of CVSD signal level (G) for both t e s t sen tences .

12 shows the average L P C d is tance as a function of gain G. T h e average d is tance is defined as

- 1 Ai d = - Z < f „ (11)

where M is t h e n u m b e r of frames in the sentence. T h e resul ts of t h e L P C d is tance analyses are str iking in t h a t the dis

tance uniformly decreases as the gain G increases. This result is in direct opposit ion to the S N R findings discussed in the first par t of this paper . 1

According to the L P C d is tance measure , t he CVSD -coded sentence is improving in qual i ty (i.e., closer in d is tance to t h e original) as the gain increases. However, according to the S N R measurements , the similarity between t h e original and the CVSD -coded sentence is decreasing as the gain increases beyond G = 0.158. Although the dissimilarity between the waveforms of the original and the CVSD-coded version with G = 1.264 is apparent from Fig. 13, it is interesting to note tha t informal perceptual exper iments indicate t h a t t h e quali ty of the C V S D coder is actually improving as the gain G increases. Since the L P C d is tance measure is sensitive to spectral distortions, it is (in this case) a better measure of quality than S N R . T h e use of the LPC distance measure as an indication of speech qual i ty has been suggested by other a u t h o r s . 1 4

V. COMPATIBILITY OF CVSD WITH LPC

As a final check on the performance of the entire system, an informal perceptua l evaluat ion of t h e C V S D - L P C t a n d e m link depicted in Fig. 1 was performed. T h e L P C vocoder was efficiently designed for a bi t ra te


( a ) O R I G I N A L W A V E F O R M

( b ) C V S D - C O D E O W A V E F O R M [G = 1.264)

Fig. 13—Waveform p lo ts of one section of a n u t t e r a n c e and the resul t ing o u t p u t of the C V S D coder for G = 1.264.

of 2.4 k b / s 2 0 and the C V S D was designed for 16 kb/s operat ion using the various gains G. For the smallest gain, G = 0.009375, the speech was unintell igible. For the higher gains, t he ou tpu t speech was intelligible, b u t the qual i ty was significantly worse than the quali ty of the 2.4 kb/s L P C synthesis . T h e qual i ty of the t andem link appeared to sa tura te (or even become slightly worse due to the poorer estimates of pitch and gain) for G > 0.158. Even for the best-quality output , the combination of C V S D noise and the paramet r ic dis tort ions of the L P C vocoder rendered the t a n d e m a marginal communica t ions link.

VI. SUMMARY

In the t a n d e m link of a wideband and nar rowband speech communication system in which the wideband system was a 16 kb/s C V S D coder and the narrowband system was a 2.4 kb/s L P C vocoder, the C V S D coder was shown to be the weak link. T h e major dis tort ion introduced by the


C V S D coder was spectral dis tort ion as measured using an appropr ia te L P C d is tance measure . This distort ion was sufficiently severe to make the L P C ou tpu t , a l though intelligible, of poor quali ty. I t was further shown tha t the waveform distortion in the C V S D coder was not so severe so as to make pi tch detect ion unreliable, and even a reliable voiced-unvoiced decision could be made on the C V S D - c o d e d speech.

T h e major conclusion from this s tudy is tha t alternative 16-kb/s coders be considered as the wideband communicat ion system for such communica t ion links. Possible al ternat ives include A D P C M s y s t ems , 2 1 sub-band c o d e r s , 2 2 and t ransform code r s . 2 3

REFERENCES

1. R. E. Crochiere , D . J. G o o d m a n , L. R. Rabiner , a n d M. R. S a m b u r , " T a n d e m Connec t ions of W i d e b a n d a n d N a r r o w b a n d Speech Communica t i ons Systeme: P a r t 1—Nar rowband- to -Wideband L ink , " B.S.T.J . , th i s issue, p p . 0000-0000.

2. L. R. Rab ine r , M. J . Cheng , A. E. Rosenberg , and C. A. McGonegal , "A Compara t ive Performance S tudy of Several Pi tch-Detect ion Algori thms," I E E E T rans . Acoustics, Speech , Signal Proc . , ASSP-24, No . 5 (October 1976), p p . 399-418.

3. C. A. McGonega l , L. R. Rab ine r , a n d A. E . Rosenberg , "A Subject ive Eva lua t ion of P i tch Detect ion M e t h o d s Using L P C Synthesized Speech ," I E E E T r a n s . Acoustics, Speech , Signal Proc . , ASSP-25, No . 3 ( June 1977), pp . 221-229.

4. M. J . Ross e t al., "Average M a g n i t u d e Difference Func t ion P i t ch Ex t r ac to r , " I E E E T r a n s . Acoust ics , Speech , Signal Proc . , ASSP-22 (October 1974), p p . 353-362.

5. J . J . Dubnowski, R. W. Schδfer, and L. R. Rabiner , "Real -Time Digital Hardware Pi tch Detec tor , " I E E E T r a n s . Acoustics, Speech Signal Proc. , ASSP-24 (February 1976), p p . 2 - 8 .

6. B . S. Ata l a n d L. R. Rabiner , "A P a t t e r n Recogni t ion Approach to Voiced-Un-voiced-SUence Classification with Applications to Speech Recognition," I E E E T r ans . Acoust ics , Speech , and Signal Processing, ASSP-24, No . 3 ( J u n e 1976), p p . 2 0 1 -121.

7. R. J . McAulay , " O p t i m u m Classification of Voiced Speech, Unvoiced Speech and Si lence in t h e Presence of Noise and In te r fe rence , " Lincoln Labora to ry Technica l N o t e 1976-7, J u n e 1976.

8. L. R. R a b i n e r and M. R. S a m b u r , "Appl ica t ion of an L P C Dis tance Met r i c to t h e Voiced-Unvoiced-Si lence Detec t ion P r o b l e m , " s u b m i t t e d for publ ica t ion.

9. C. A. McGonegal , L. R. Rabiner , a n d A. E. Rosenberg , "A Semi -Automat i c P i t ch Detector ," I E E E T r ans . Acoustics, Speech, Signal Proc., ASSP-23 (December 1975), p p . 570-574.

10. L. R. Rab ine r , M. R. S a m b u r , a n d C. E. Schmid t , "Appl ica t ions of a Nonl inear Smoothing Algorithm to Speech Processing," I E E E T rans . Acoustics, Speech, Signal Processing, ASSP-23, No . 6 (December 1975), p p . 552-557.

11 . M. R. S a m b u r a n d N . S. J a y a n t , "Speech Enc ryp t ion by Man ipu la t ions of L P C P a r a m e t e r s , " B.S.T.J . , 55 , No . 9 (November 1976), p p . 1373-1388.

12. F . I takura, " M i n i m u m Predict ion Residual Principle Applied to Speech Recognition," I E E E T r ans . Acoustics, Speech, Signal Processing, ASSP-23, No. 1 (February 1975), p p . 67 -72 .

13. A. H . Gray J r . and J . D. Market , "Di s t ance Measu res for Speech Process ing ," I E E E T r a n s . Acoust ics , Speech , Signal Proc . , ASSP-24 (October 1976), p p . 380-391 .

14. M. R. S a m b u r and N. S. J a y a n t , " L P C Ana lys i s /Synthes i s F r o m Speech I n p u t s Containing Noise or Additive Whi te Noise," L E E E T r ans . Acoustics, Speech, Signal Proc., ASSP-24 (December 1976), No. 6.

15. H . Wak i t a , " O n t h e Use of L inea r Pred ic t ion E r r o r Energy for Speech and Speake r Recogni t ion ," J . Acoust . Soc. Amer. , 57, S u p p l e m e n t No. 1, Spr ing 1975 (A).

16. D. T . Magil l , " A d a p t i v e Speech Compress ion for P a c k e t Communica t ion S y s t e m s , " T e l e c o m m u n i c a t i o n s Conference Record , I E E E Pub l . 73, CH0805-2, 29D 1-5.

17. J . R. M a k h o u l , L. V i swana than , L. Cosel, and W. Russel , " N a t u r a l Communica t i on wi th C o m p u t e r s : Speech Compress ion Research a t BBN," BBN R e p o r t No. 2976, / / , Bol t Be ranek and N e w m a n , Inc. , Cambr idge , Massachuse t t s , December 1974.


18. J . D. Marke l a n d A. H . Grav Jr. , Linear Prediction of Speech, Spr inger Verlag, 1976.

19. L. R. Rab ine r , B . S. Atal , a n d M. R. S a m b u r , " L P C P red ic t ion Er ror—Analys i s of its Var ia t ion wi th t h e Pos i t ion of the Analysis F r a m e , " s u b m i t t e d for publ ica t ion .

20. M. R. S a m b u r , " A n Efficient L inear Pred ic t ion Vocoder , " B.S.T.J . , 54, No. 10 (Dec e m b e r 1975), p p . 1693-1723.

2 1 . P . Cummiskey , N . S. J a y a n t , and J . L. F lanagan , " A d a p t i v e Quan t i za t ion in Different ia l P C M Coding of Speech , " B.S.T.J . , 52, No. 7 (Sep t embe r 1973), p p . 1105-1118.

22. R. E. Crochiere , S. A. Webber , and J . L. F lanagan , "Dig i ta l Coding of Speech in S u b b a n d s , " B .S .T .J . , 55 , No . 8 (October 1976), p p . 1069-1086.

23. R. Zelinski a n d P . Noll , " A d a p t i v e T r a n s f o r m Coding of Speech S igna ls , " t o a p p e a r in IEEE T r a n s . Acoustics, Speech, Signal Proc. , ASSP-25, No . 4 (August 1977), pp. 299-309.


Documents

Tandem Connections of Wideband and Narrowband Speech Communication Systems Part 2-Wideband-to-Narrowband Link