Digital Representations of Speech Signals

Embed Size (px)

Citation preview

  • 8/13/2019 Digital Representations of Speech Signals

    1/18

    6 6 2 PROCEEDINGS OF THE IEEE, VOL. 63, NO. 4, APRIL 1975

    Digital Representations of Speech SignalsRONALD W . SCHAFER, SENIOR MEMBER, IEEE, AND LAWRENCE R . WB INE R, MEMBER, IEEE

    Invited Paper

    Absmct-This paper presentsseveral digital signal procesing methodsfor representing speech. Inciuded among the representationsare simplewaveform m d q methods; time domain techniques; frequency domainrepresentations; nonlinear or homomorphic methods; and f i i y inearpredictive coding techniques The advantagesmddisldvantagesof eachof these representations for vprious speech procesdng applications arediscussed

    I . I N T R O D U C T I O NHE NOTION of a representation of a speech signal iscentral to almost every area of speech communicationresearch. Oftenhe form of representation of thespeech signal is not singled out for special attentio n or con cernbut ye t it is implicit n the form ulati on of a problem or inthe design of a system. A good example of this situation is

    in telep hony, where speech is, in fact, represented by fluctu-ations in electrical current for purposes of long distance trans-mission. In other situations, however, we must ofte n pay strictattent ion to the choice and metho d of im plementation of therepresentation of the speech signal. This is true , for example,in s uch diverse areas as speech transmission, com puter storageof spee ch a nd om put er voice response , peech ynthesis,speech ids for hehandic apped , speaker erification ndiden tifica tion , and speech recognition. In all of these areas,digital representations; i.e., representationsas equences ofnumbers, are becoming increasingly domina nt. There are twobasic reasons for this. Firs t, thro ugh the use of small generalpurpose digital computers, speech researchers have been ableto apply a wide variety of digital signal processing techn ique sto speech com mun ication problems. These techniq ues cover arange of complexity and sophistication tha t is impossible tomatch w ith analog methods. Second, the recent and predictedfutur e developments in integrated circuit technology ma ke itpossible to realize digital speech processing schemes economi-cally as hardware devices having the same sophistication andflexibility as a com puter program implem entation .The purposeof this pape r is to survey the importan tand mostuseful methods for obtaining digital representations of speechsignals. This is a formidable task since the num ber and varietyof such methods is great. Thus we must begin by disclaimingany prete ntion s to completeness; we shall only try to poin t outthe meth ods that in o ur view are the most useful in t he techni-cal and research areas of speech com munication.The organization of this pap er is as follows. In S ectio n 11, webriefly review the speech production process and show how itcan be mode led with a simple digital representation. We the ndiscuss a class of waveform coding me tho ds for representing

    Manuscript received September 18, 1974;revised November 25, 1974.R. W. chafer was with the Bell Laboratories, Inc., Murray Hill, N.J.

    speech in Section 111. Included in this class are linear pulse-code mod ulation (PCM), delta mo dulation (DM), differentialPCM, adaptive deltamod ulation, and finally adaptive dif-ferential PCM (DPCM). It is shown a t the end of this sectionthat if an adaptive predic tor is incorp orated in these models,the w aveform coding techn ique becomes qu ite similar to th elinear predictive coding metho d to be discussed in Sectio n VII.In Section IV,we discussvarious time-domain representationsof speech.Included n his section are the conce pts of zerocrossing analysis, autocorrelatio n unction s, peak-to-peaktype estimations, and the use of energy functions. In Sec-tion V, we discuss frequency dom ain representations of speechfor which the concept of short-tim e spectrum analysis is domi-nant. Several examples of systems based on short-time spectru manalysis are given in this section.In Section VI, we discuss the topic of homomorphic analysisof speech. In this sec tion the con cep tof the cepstrum is intro-duced. Finally, in Section VII, e discuss the two asic metho dsof linear predict ion analysis, explain their similarities and dif-ferences and discuss the basic conc epts which are derivable fromthem ncluding thespectrum,cepstrum, and autocorrelationfunction.

    11. A DIGITAL ODEL FOR PRODUCTION F THESPEECH IGNAL 11-[3]A schematic diagram of the human vocal apparatus is shownin Fig. 1. The vocal tract is an acoustic tube that s terminatedat one end by th e vocal cords and at the other endby t he lips.

    An ancillary tube, he nasal tract , can be connected or dis-connected by the movement of the velum. The shape of thevocal tract is determined by the position o f the lips, jaw,tongue , and velum.Sound is generated n his ystem n hree ways. Voicedsounds are produced byexciting the vocal tract with quasi-periodic pulses of air pressure caused by vibration of th e vocalcords. Fricative so unds are produc ed by forming a constrictio nsomewhere in the vocal tract, and forcing air through the con-striction, thereby creating turbulence which produces a sourceof noise t o excite the vocal tract. Plosive sound s are created bycompletely closing off the vocal tract , building up pressure, andthen quickly releasing it. ll these sources create a wide-bandexcitatio n of the vocal tract which in turn acts as a linear time-

    varying filter which imposes its transmission properties on thefrequency spectra of the sources. The vocal tract can be char-acterized by its natural frequencies (or forma nts) which corre-spond to resonances in the sound transmission characteristicsof the vocal tract.A typical speech waveform is shown inFig. 2 , which illustratessome of the basic proper ties of the speech signal. We see, for

  • 8/13/2019 Digital Representations of Speech Signals

    2/18

    SCHAFER AND RABINER: SPEECH SIGNALS 663

    MUSCLE FORCE W S I L T R K T HOSTRlL I PITCHERKY) DIGITALILTEROEFFICIENTS

    Fig. 1 . Schematic diagram of mechanism o f speech production. (AfterFlanagan er al. [2 1 .)

    T - V & - =i I

    Fig. 2. An illustration of a peechwaveform,corresponding to heutterance Should w e chase .rather consta nt. Such segments are demarked in Fig. 2 belowthe waveform. These ample egments have the appearanceeither of a low-level random (unvoiced) signal (as in or tJ inFig. 2) or a high-level quasi-periodic (voiced signal) (as in rw o r i with each period displaying the ex ponen tially decayingresponse properties of an acoustic ransmission ystem. Wenote hat hedynamic range of the waveform is large; i.e.,the peak amplitude of a voiced segment is much larger thanthe peak amplitude of an unvoiced segment.

    Because the sou nd sources and vocal tract shape are relativelyindependent, a reasonable approximation is to model hemseparately, as shown in Fig. 3. In this digital mo del , samples ofthe speech waveform are assumed t o be th e outp ut of a time-varying digital filter th at app roxi ma tes the transmission prop-erties of the vocal tract and thespectralproperties of theglottal pulse shape.Since, as is clear from Fig. 2, the vocaltract changes shape rather slowly in continuous speech (like-wise its sound transmission propertie s) it is reasonable to as-sume tha t he digital filter in Fig. 3 has fixedcharacteristicsover a tim e interval of on th e orde r of 10 ms. Thus the digitalfilter may be characterized in each such interval by an impulseresponse or a set of coefficients fo r a digital filter. Fo r voicedspeech, the digital filter is excited by an impulse train generatorthat creates a quasi-periodic impulse train in which the spacingbetween impulses corresponds to the fundamentalperiod of theglottal excitation .' For unvoiced speech, the filter is excitedby a random number generator that produces f lat spectrumnoise. In bo th cases, an mplitu de ontrol regulates theintensity of the in pu t to th e igital filter.This model is the basis of a wide variety of representations ofspeech signals. These areonveniently classified as eitherwaveform representationsorparametric epresentations de-pending upon whether the speech waveform is represented di-rectly or whether the representations in terms of time-varyingparameters of the basic speech model. These represe ntations

    (VOCAL TRACTPARAMETERS)

    Fig. 3. Digital processing model for production of speech s i g n a i s .form taken periodically in time to estimates of the parametersof the model in Fig. 3 . The choice of the digital represe ntationis governed by three majorconsiderations: processing com-plexity,nformationbit)ate, andlexibility. By com-plexity, we mean the am oun t of processing required to obtainthe chosen representation. In many cases processing com plexityis a measure of cost of implementation of the system in hard-ware. A simple representation can generally be implementedmore econ.omically th an a com plex representation. huscomplexity is often the ovem ding consideration in some trans-mission applications where low terminal cost is crucial. Infor-matio n or bit rate s a measure of the redundancy in the speechsignal which has been removed by the processing. A low bitrate means that the digital represe ntation of the speech signalcan be transm itted over a low capacity channel, or store d ef-ficiently in digital me mory. Finally flexibility is a measu re ofhow th e speech can be manipulated or altered for applicationsoth er tha n ransmission , e.g., voice response, speech recognition,or speakerverification. In general, greater com plex ity is theprice paid t o lower the bit ra te and increase the flexibility.However, tradeoffs can generally be made among these threefactors. In transmission and voice response applications thequality and intelligibility of the reconstitu ted speech are alsoprime considerations. Most of the echniq ues we will discussare capable of producing good qu ality , highly intelligible speech,although some of the techniq ues re primarily analysis meth ods,and as such are limited t o applications where the speech signalneed not be reconstructed.In the remainder of this paper, we will discuss a number ofof digital representations tha t span the spectrumof possibilitiesin each of the above areas of concern. We shall begin with thesimplest, least fficient nd east lexible repr ese ntat ion ofspeechand progress to morecom plex ones which have thegreatest flexibility and lowest bit rate.

    111. DIGITAL AVEFORM CODINGConceptually, the simplest digital representations of speechare concerned with direct represen tation of the speech wave-form. Such schemes as PCM, DM, and DPCM are all based onShannon's sampling theorem , which says tha t any bandlimitedsignal can be exactly recons tructed from amples taken periodi-

    cally in time if the sampling rate is twice the highest frequencyof th e signal. We begin with a discussion of the simplest wave-form coding tech niq ue; i.e., PCM.A . PCM

    In applying the sampling theorem to a digital representationof speech there are two main concerns. These are depicted inFig. 4. If the signal bandwidth is W hertz, then the sampling

  • 8/13/2019 Digital Representations of Speech Signals

    3/18

    664 PROCEEDINGS O F THE IEEE, APRIL 1 9 7 5

    IY

    Fig. 4. Sampling and quantizing of an analog signal.

    sent the samples as B-bit binary words, the n the bit rate is 2BWbits/s. Th e value of W required for speech signals depen ds onthe ultima te use of the samples. We know from measurementsand theoretica l stu dies thatpeech sounds such as fricatives haverather wide bandwidths on he orderof 10 kHz). On the oth erhand much of the info rm ation equired for speech intelligibilityis contained in the variation of the f i i t three ormant fre-quencies of voiced speech and these are typically below 3 kHz.Thus, a sampling rate between 6 kHz and 20 kHz is generallyused. No ma tter what the sampling rate is, the speech signalmust be suitably low-pass filtered prio r to the ampling processto eliminate undesired high frequencies of the speech and highfrequency noise.The choice of the number of bits per sample B is also de-pendent upon the intended use of the samples. If our purposeis transmission orcom pute r storage followed by conversionback to an analog signal, we areonly oncerned tha t heresulting analog signal be perceptuallycceptable. Also,the sampling process jus t described is generally the firststep in any digital speech analysis techniques. Since erro rs in-curred in the sam pling process will propagate to more refineddigital representations, we are ofte n justified in a very generousallotme nt of bits and sampling rate if the sampled speech waveis to undergo f urthe r processing. However it should be notedthat heam oun t of processing required to implementmostsystems is prop ortion al to sam pling rate. Thus we should tryto keep the sampling rate as low as possible, c onsisten t withother objectives.

    One objective measure of the fidelity of the PCM representa-tion is the ratio of the signal power to th e quantiza tion noisepower. Ifwe define the quantiza tion noise in Fig. 4 as thee ( n T ) = x ( n T ) - Q [ x ( n T ) ] = x ( n T ) - y ( n T )

    then it can be shown [ 4 ] , 71 that abou t 11 bits are requiredin orde r th at the signal-to-noise ratio (S NR ) be 6 0 dB. (This isoften referred t o as toll quantity .) It is easily shown thatthe add ition of one bit changes the SNR by 6 dB.The preceding discussion can be summarized by stating thatan adequate PCM representation for mostpurposes equiresfrom 66 000 bit+ 1 1bits X 6 k Hz) o2 2 0 000 bits/s(1 1 bits X 20 kHz). This is a very significant consideration intransmission o r storage for processing on a computer .Since we generally have little flexibility in lowering thesampling rate, as this is governed by other considerations, the

    following:

    nam ely, that speech has a wide dynamic range. We see fromFig. 4 that if B is fixed, then the step size A must be chosen sothat A * 2B spans the maximum peak-to-peak range of the sig-nal. Thus the quantiz er step ize is determined by the amplitudeof the voiced segments of speech whereas a good represen tationof unvoiced segm ents requires a mu ch smaller step size.

    One solution to this problem is to use a nonlinear quantiz ercharacteristic which dis tributes hequantiza tion levels moredensely for lower amplitudes than for high amplitudes. Basedon empirical determ inations of the amp litude distribution ofspeech signals, a logarithmic qua ntiz er characteristic has beenfound t o be nearly optimum [9 1 . Using a logarithmic quantiz er,7 bits/sample are sufficient to obtain oll quality. An alternativeapproach is the use of a tim e v arying step size [51 -[ 7 1 , .e., anadaptive qu antizer. When the signal level is low, a small stepsize is used; and when the signal amplitude is large, an ap-propriate large step size is used. The adjustm ent of the stepsize may be done by logical operations on

  • 8/13/2019 Digital Representations of Speech Signals

    4/18

    SCHAFER AND RABINER: SPEECH SIGNALS

    Fig. 5. General differential quantization scheme.

    -11 1 I I -1 1 -1 -f -4 -1 1 -1 I -1 I

    *-QTIME

    (b )Fig. 6. Illustration of delta modulation. (a) Fixed step size. (b) Adap-tive step size.A more general predictor polynomial is of the form

    Pp z )= Q k z - k . (3)k = l

    The basic principle of linear prediction is applied inmoregenerality in Section VII.In using differential quantiza tion, we are free t o choose thesampling rate, the quantizer and the predictor so as to reducethe bit rate. If the sampling rate is much higher than the Ny-quist rate, hecorrelation betw een adjacent samples is veryclose to one and it s possible to use a 1-bit quantizer to obtaina good approxim ation t o the inp ut samples. This case, illus-trated in Fig. 6(a), is called DM. In Fig. 6(a), we have illustratedhow x ( n ) , x (n), and P ( n ) vary with time. (We have shown thecase where Q = 1.)The quantiz ed difference signal has the form

    (n) = A * c ( n )where

    +1, if S(n) 2 0-1, i f 6 ( n ) < Oc ( n ) =

    and A is the fixed s tep size. Fig. 6(a) shows the two types oferrors that are inheren t n differential quan tization schemes.On the left of the figure, the slope of the waveform is greaterthan the maximum rate of increase of the staircase approxima-

    665

    varying parts of the waveform there is a ten denc y to oscillateup and down about the waveform. This is called granular dis-tortio n. In such regions we would like to have a smaller stepsize to reduce the magnitude of the quantization error .One s oluti on to this dilemma is to let the step size vary sothat A becom es large du ring slope overload and small duringgranular distortion. This can be done by searching for pattern sin the code word sequence c ( n ) . For example a run of +l s o r- 1's means slope overload, while an altern ating pattern meansgranularity. A simple logic for varying the step size is [ 6 ]A(n) = PA(n - l ) , if c ( n ) = c(n - 1)QA(n - l ) , if c ( n ) c ( n - 1 ) .

    The q uantized difference signal is now(n) = A ( n ) . ~ ( n ) .An optimum choice of the parameters is [6]

    P = .5, Q = 1/P.This schem e is illustrated by Fig. 6(b). (Here, for simplicity wehave assumed P = 2 and a = 1 ) It can be seen that this adaptivedelta modulator (ADM) is able to follow apid increases inslope and also it is able to use a smaller step size in regions ofgranularity. In practice, imits re placed on he tep sizevariation so that Amin 4 A(n)

  • 8/13/2019 Digital Representations of Speech Signals

    5/18

    666 PROCEEDINGS OF THE IEEE, PRIL 1975

    fundamental periodand the am plitude of each basic period.On the other hand, we may select an unvoiced segment wherethe signal is characterized by the lack of periodicity and theamplitude of the w aveform. Since these properties vary fromsegment-to-segment, it is com mo n to analyze speech on a t ime-varying basis by carrying ou t an analysis on sh ort segments ofspeech selected at uniform ly spaced time intervals.A . Peak Measurements

    I t is only necessary to glance at Fig. 2 to see that duringvoiced intervals, th e speech signal is characterized by a sequenceof peaks that occur periodically at the fundamental frequencyof the speech signal. In con trast, during unvoiced intervals thepeaks are relatively sm aller and do no t occ ur n any discerniblepattern. Thus the maximum peak amp litude during an analysisinterval can serve as a simple indication of the amp litude of thesignal and as an aid in distinguishing between voiced an d un -voiced speech segments.Th e time betwe en correspon$ing peaks is, of course, equalto thefundam ental period for voiced speech. This principlehasbeen used ina number of schemes for determining thefundam ental period or pitch period. A difficulty with this ap-proach is that even over a short analysis interval, the speechsignal is not exactly periodic. Since each period has a' num berof peaks, it ispossible to make several different estimates of theperiod. A meth odor logically comb ining the results ofseveral simple measurements of this kind to improve accuracyhas been discussed by Gold and Rabiner [ 121 [ 13 By care-ful choice of the basic measurements and careful design of thelogic, the accuracy of thecom bined results is mu ch greaterthan the accuracy of any of the individual estimates.B. Energy Measurements

    One of the simp lest representations of a signal is its energy.In the case of a real discrete-time signal x ( n ) , he energy s de-fined in general as00

    For nonsta tionary signals such as speech, it is often more ap-propriate to consider a time-varying energy calculation su ch asthe following:

    N-1E ( n ) = [ w ( m ) x ( n m ) ] 5 )m =O

    where w ( m ) is a weighting sequence o r window which selects asegment of x ( n ) ,and N s the numberof samples in the window.For he simple case of w ( m ) = 1, E ( n ) is the sum of thesquares of the N most recent values of x ( n ) . Fig. 7(a) showshow the energy measurement of 5 ) can be viewed in terms offdtering the sequence x 2 ( n ) by a finite impulse response (FIR )filter with impulse response w 2 ( n ) .I t is to be expected that the function E ( n ) would display thetime varying amp litude propertie s of the speech signal. How-ever, the definition of 5 ) requires careful interpreta tion. Firstthere is the choice of window. The purpose of the window isto attach lower weight to speech samples which occurred

    S O U A R E R h n)=w 2 n )(a)

    t n )b)

    Fig. 7. (a) Implementation of short-time energy calculationusing afinite mpulse response digital Nte r. (b) An alternative definition ofenergy.

    TIME ISWIFig. 8. Normalized energy for the word /six/.

    second difficulty involves choice of measurement nterval N .If N is too small, i.e., less than a pitch period, E ( n ) of 6)willfluctuate very rapidly depending on ex act details of the wave-form. If N is to o large, i.e., several pitc h periods , E n ) will havevery little variation, an'd will not reflect th e changing propertiesof the speec h signal. A suitable practical choice of N is on theorder of 100-200 fo r a 10-kHz samp ling rate (i.e., 10-20 ms ofspeech).The major significance of E ( n ) is that it provides a good m ea-sure for separating voiced speech egments from unvoicedspeech segments. E n) for unvoiced segments is muchsmaller thanor voiced segments. Also the smaller thevalue of N , the less smearing there is in locating theexactins tan tat which unvoiced peechbecomes voiced and viceversa. Fu rther mo re, for very high quality speech, the energycan be used to s eparateunvoiced speech from silence.Onedifficultywith energy measurements is that hey arevery sensitive to large signal leveldb ecau sethey e nter the com -puta t ion as asquare), thereby emphasizing large sample-to-sample variations inE ( n ) . One relatively simple way of alleviat-ing this problem is to use as a measure of energy, the fun ction

    where the sum of absolute values is computed instead of thesum of squares. Fig. 7(b) shows an interpreta tion of ( 6 ) as alinear iltering operatio non Ix(n) l . Fig. 8 shows the energyfunction for the word six for a IO-ms rectangular windo w. Itis easy to see the lo w energy fricative regions at the beginningand end of six, and the stop gap region during the /k/ or which

  • 8/13/2019 Digital Representations of Speech Signals

    6/18

    SCHAFER AND RABINER: SPEECH SIGNALS

    C. Ze ro Crossing MeasurementsAno ther very simple time domain analysis meth od is basedon zero crossing measurements. In thecon text of a digitalimple men tation, a zero crossing can be said to occur betweensampling instan ts n and n - 1 if

    sign [ x ( n ) l #sign [ x ( n - 1)l .This measurement is trivial to imp lem ent and is often used as agross estimate of the frequency content of a speech signal. Itsuse is motivated by the observation th at if the signal is a sinu-soid of frequenc y fo , hen the average num ber of zero crossingsis

    nr = 2f0 crossings/s. 8)However, the interpretation of zero crossing measurements fo rspeech is mu ch less precise, because of t he broad frequencyspe ctru m of most speech sounds. Nevertheless, very crude esti-mates of spectrum properties such s this may often suffice.Forexample, t is well know n hat he energy of voicedspeech tends to be concentrated below 3 kHz, whereas theenergy of fricatives generally is concentrated above 3 kHz.Thus, ze ro crossing measurements (along with energy inform a-tion) are often used inmakingadecision aboutwhether aparticular segme nt of speech is voiced or unvoiced. If the zerocrossing rate is high, the implication is unvoiced; if the zerocrossing rate is low , the segm ent is most likely t o be voiced.Zero crossing measurements,coupledwith a pitchdetectionscheme, provide a useful approach to estim ation of excitationparameters [ 3 4] . Zero crossing measurements have also beenuseful as representations of speech signals for speech recog-nition [ 141.In implementing zero crossing measurements digitally, thereare a num ber of impor tant considerations. Although the basicalgorithm requires on ly a compa rison of signs of two successivesamples, special care must be tak en in the sampling process.Noise, dc offset, and 60-Hz hum have disastrous effects on zerocrossing measurements. Thus for zero crossing measurements abandpass filter rath er than a low-pass filter may be necessaryprior to sampling to avoid the said difficulties. Also, thesampling period T determ ines the time resolution of the zerocrossing measurements; thus fine resolution requiresa highsampling rate. However, very crude q uan tizati on (1 bit n fact)is all tha t is necessary t o preserve th e zero crossing inform ation.D . Short-Time Autocorrelation Analysis

    The autocorrelation function of a discrete-time signal x ( n ) sd e f i e d as1 N

    The autocorrelation function is useful fo r displaying structu rein any waveform, speech being no excep tion. For exam ple, ifa signal is periodic with period P, .e., x ( n +P)= x ( n ) for all n ,then it s easily shown tha td m ) = d m PI. 9 )

    667

    increases indicates a lack of predictable st ruc tur e in the signal.As we have observed, speech is no t a stati ona ry signal. How-ever, theproperties of the speech signal remain fixed overrelatively long ime intervals. As we have alreadyseen, hisleads to thenotio n of short-tim e analysis techniques thatoperate on short segments of the speech signal. For exampleconsider a segment of N samples of the signal

    x l ( n ) = x ( n + I ) , O 6 n < = N - 1 10)where I denotes the beginning of the segment. Then the short-time autocorrelation function can be defined as

    1 N-1cpr(m) = x z ( n ) x l ( n+ m ) , 0 5 m 6 Mo - 1 (11)n =owhere Mo denote s he maximum lag tha t isof interest. Forexam ple, if we wish to observe periodicity in a waveform , thenwe would require Mo >P. The integer N is for the mome ntunspecified.We can interpret (1 1) as the autoco rrelation of a segment ofthe speech signal of length N samples beginning at sample 1. IfN = N , then data from outside the segment I < n Q N + - 1 isused in the com puta tion. If N = N - m , then only data fromtha t interval is required. In this case, the segment is oftenweighted by awindow function hat smoo thly tapers theends of the segment to zero. In using the autoco rrelation fun c-tion to dete ct eriodicity in speech, either choice is satisfactory ;howev er, we shall see in Section VI1 th at t he distinction is im-porta nt in analysis method s based on linear prediction . Ineither case, the direct com puta tion of q l ( m ) or 0 6 m = Mo - 1requires computational effort proportional to Mo N . This canbe a significant overhead factor.Short-time analysis met hod s typically are applied t o estim ateparameters of th e speech model discussed in Section 11. Th enormalassumption is thataltho ugh a sampling rate rangingfrom 6 kHz to 20 kHz may be necessary to preserve th e essentialfeatures of the speech signal in a PCM representation, muchlower sampling rates suffice for the slowly varying parame tersof the model 50 to 10 0 Hz is typical). Suppose for examplethat the sampling rate of the speec h signal is 10 kHz and theshort-time autocorrelation is to be computed 100 times/s. Theestima te of the autoco rrelation is generally based up on from20- to 40-ms segments of th e speech signal. (Fo r estimates ofperiodicity, the window must be long enough to encompass atleast tw o periods of the speech signal.) Th us, or a 10-kHzsampling rate 200

  • 8/13/2019 Digital Representations of Speech Signals

    7/18

    668

    Fig. 9. Illustration of center clipping.

    1LLAB EMS

    Fig. 10. Sequence of autocorrelation functions for centerclippedspeech.

    autocorrelationunctions each stimated from 30-ms seg-me nts of center-clipped speech which are selected a t intervalsof 15 ms (6 6%-H z sampling rate).From a set of co rrelation fu nctions of this typ e it s possibleto estimate the pitchperiod simply by locating the stron gpeaktha t is in evidence during voiced intervals. So ndh i [ 151 gives a

    PROCEEDINGS OF THE IEEE, APRIL 1975

    V. SHORT-TIMEPECTRUMNALYSISShort-tim e spectrum analysis has raditionallybeen one ofthe mos t importa nt speech processing techniques. As we havepreviously stated, the fundam ental assumption underlying anyshort-tim e analysis m ethod is that over a ong-time nterval,speech is nonstationarybut hat over a ufficiently short-time interval it can beonsidered stationary.Thus,heFourier transform of a short segment of speech should give agood spectral re presentatio n of the speech durin g hat imeinterval. Measurement of the shor t - t ime spect rum is the basicoperation in the channel vocoder [ 191, [261 the phase vocode r

    [ 181, spectrogram displays [211 , 23 ], and some peechrecognitionsystems [20]. Tw o me thods are commonly usedfor implementing s hort-time Fourier analysis. Th e first uses aban k of bandpass filters. This me thod was originally used withanalog filters and it can be implemented with even greaterprecision andlexibilitywith digital filters. Th e secondmethod uses afast Fourier transform (F FT ) algorithm.Thismeth od is fundamenta lly digital and has no analog counter-part . When implemented on a comp uter , the FFT metho d isgenerally com putationally superior o the ank-of-filters m odel.A . Filter Banks for Short-Time Spectrum Analysis

    Fig. 11 shows asimple way of implem entingashort-timespectrum analyzer using a bank of bandpass ilters. If th efilterpassbands are chosen to cover the speech band , hen ,roughlypeaking, the utputs can be thought of as aFourier rep resentation of the inp ut speech signal. If the filtersare carefully designed, the sum of all the filter ou tpu ts will bea good approxim ation to th e original speech signal [ 241 . Thisis the basis forcommunication systemssuch as the channelvocoder and the phase vocoder.Based on some fundamental ideas of spectrum analysis, thediscrete short-time spectrumof x ( n ) s defined as

    ZX l o )= x ( n ) h(Z - n ) e - i w n ( 12 4

    ,=- OD= IX Z ( 0 ) l ? @ l ( w ) (1b )= q w ) jbz(w). ( 12c)

    Equation 12) canbe interpreted ina number of ways. Asshown in Fig. 12, one interpretations that Xl w) s the Fouriertransform of a sequence x ( n ) that is weighted by a windowh(l - n ) . Thus the short- t ime Fourier transform is a functionof both f requency w and the discrete time index 1. A secondinterpretation follows if we assume that h ( n ) s the impulse re-sponse of a low-pass digital filter. Assume tha t we wish t oevaluate thehort-time transform atrequency 0. ThenX n w ) s seen to be the out put of the low-pass filter w ith inputx( n ) e - w n. This is depicted in Fig. 13(a). To avoid complexarithme tic, the system of Fig. 13(a) is generally imple men tedas shown in Fig. 13(b) where the o utp ut parameters are a, o)and b,(w), the real and imaginary parts of the spectrum. Thebandw idth of the low-pass filter de termine s he requenc yresolution. Typically, this bandw idth is on th e ord erof 50 Hz.Thus thespectrum signals canbe sampled at a much lowerrate 100 Hz) th an the speech signal itself.Using digital f il ters , i t has been shown [24 ], [25 ] tha t the

  • 8/13/2019 Digital Representations of Speech Signals

    8/18

    SCHAFER AND RABINER: SPEECH SIGNALS 669

    4a0r

    Fig. 1 1. A bank of bandpass fiiters.

    n

    Fig. 12. Illustration of computation of thehort-time Fouriertransform.

    (b)Fig. 13. Short-time Fourier analysis and synthesis for one channelcentered at w .

    rate on the order of the bit rate required for comparable PCMreprese ntation. However, the resulting represe ntation of thespeech signal permits greater flexibility in the sense tha t thespectral arameters a n ( w ) and b n ( w ) provide informationabout the parameters of the speech model in a convenient anduseful form. For example the time and frequency dimensionsof a speech signal can be independ ently manipulated throughsimple manipulations of the spectral parameters [ 181.B. Use of the F F T f o r Short-Time Spectrum Analysisthe discrete Fourier transform (DFT)expressionsT h e F F T is a set of highly efficient algorithms for evaluating

    Q6 ITINE (msec)

    (b)Fig. 14. (a) Log magnitude of he hort-time transform. (b) Corre-

    sponding windowed speech segment. (N 500.)and

    1 M - 1f ( n ) = -( k ) e x p , n = O , l ; . - , M - 1. (14)k = oFor using theseexpressions, it is convenient to define theshort- t ime transformas

    N - 1x~(w) xl(n ) w(n) e - j w n ( 1 5 4n=owherex l ( n ) = x ( n + l ) , = O , l ; * - , N - 1, l = O , L , 2 L ; * * . (15b)As in the case of the short- t ime autocorrelation function, weinterpret 15a) as theFourier ransform of a egment ofspeech N samples long (weighted by a window w(n)), begin-ning at 1 The freque ncy resolution of the sp ectrum measure-ment is inversely proportion al to th e window length N . Thisis illustrated in Fig. 1 4. Fig. 14(a) shows the short-tim e trans-form and F ig. 14(b) shows the corresponding windowed seg-ment of speech data. A Hammingwindow [ 171 of length50 ms was used. (N 500 samples at a 1@kHz sampling rate.)No te hat he individualharmonics of thepitch periodareresolved inhe short-time transform . Figs. 15(a) and (b)show the short- t ime transform and the windowed speech forN = 50 samples. (The speech segment is the first 50 samplesof the segm ent hown n Fig. 14(b).) In this case the fre-quency resolution is muc h less than in Fig. 1 4. We note thatthe spectrum of Fig. 14 could be considered com parable to aconventional arrow-bandpectrogrammeasurement whileFig. 15 is com para ble to a conventional w ide-band spectrogramanalysis. In particular, Figs. 14 and15show ypicalspectral

  • 8/13/2019 Digital Representations of Speech Signals

    9/18

    670 PROCEEDINGS O F THE IEEE, APRIL 1975

    -ao.oL

    02 -E-

    I I I I I40 20 40 x

    0c

    -0.4 --0.6L

    Fig. 1 5 . (a) L o g magnitude of the hort-time ransform. b) Corre-sponding windowed speech segment. ( N = 50.)

    An FF T algorithm can be used to com pute (15) at equallyspaced requencies w k = 2n k /M , for k = 0, , ,M - 1. IfM 2 , then the sequence x l ( n ) w ( n )must be augmented withM - N zero valued samples to for m a sequence of length M .In this case we can com pute

    k = 0 , l , - . . , M - 1 (16)using an FF T algorithm.On the o ther hand if M

  • 8/13/2019 Digital Representations of Speech Signals

    10/18

    SCHAFER AND RABINER: SPEECH SIGNALS 611

    A DISCRETEINVERSEDISCRETEFOURIER C .FOURIERR A N S F W MRANSFORM

    DATAW I N W W

    INPUT SPEECHEGMENTINORMALIZEO AN0 WEIGHTEDBY A HAMMINGWINDOW1

    I

    INPUT PEECH EGMENT(NORMALIZED AN0 WEIGHTEDBY A HAMMING WMDWI

    C E P S ~ U MWINDOW

    0 )A N A L Y S I S FOR VOI C EDPEEC H

    CEPSTRUM SPECTRA[LOG MAGNITUM IN dB)

    Jbo04020-204 00 4 12 16 20 0

    TI rnsecl2 3 4 5

    FREOUENCY ( k H 2(b

    A N A L Y S I S FOR U N V O I C E DS P E E C HCEPSTRUMPECTRA

    I I a (LOG MAGNITUDE IN dB)

    k-----i:-20Fig. 17. Homomorphic processing of speech. (a) Basic operations. b) Awlysia for voicedspeech.(c) Analysis for unvoiced s p e e c h .ventional spectrograms [ 211 [23 I [ 281. As we have pointe dout, the re is great flexibility for comp uter spectral analysis inwindow length and shap e or equivalently frequency resolution.Also, the spectrum can e shaped in a manner o e nhanc e i t fordisplay,and it is possible to preciselycorrelate thespeechwaveform with the spectrographicdisplay.Such scheme s have been im plem ented in a variety of waysbutmost of them use (15) ocompute aset of short- t imespectraat equallyspaced time intervals. This set of spectracan be thought of as samples of the two dimensiona l functionXl w)which can be plotted as a frequency-time-intensity ploton an oscilloscope or television m onitor. Using such ech-niques it has been possible toproduc e on-line pectrogramdisplays that are equal in quality to conventional spectrogramsand far surpass them in flexibility and innovation. An exam pleof one approa ch is shown in Fig. 16 [231.

    VI. HOMOMORPHICPEECH ROCESSINGHomomorphic f i l ter ings a class of n onl ine ar signal processingtechn iques hat s based on a generalization of the principle

    of superpositio n that define s linear systems. Such techniq ueshave been applied in separating signals that have been comb inedby multiplication and convolution [3 1] . The applic ation ofthese echniqu es o speech processing is again based o n heassumption that although speech production is a time varyingprocess, it can be viewed on a short-time basis as the convolu-tion of an excitatio n func tion (eithe r rando m noise or a quasi-

    A . FundamentalsA homom qrphic pstem or peech analysis is shown in

    Fig. 17(a). We assume that the signal at A is the discrete con-volution of the e xcitation and the ocal tract impulse response.Then the short-tim e Fourie r transform (i.e., the spectru m ofthe windowed signal), computed using the F FT meth od of theprevious section, is the produ ct of the Fou rier transform s oftheexcitationand he vocal tract mpulse esponse. Takingthe logarith m of the mag nitude of the Fou rier transform , weobtain at C the sum of the logarithm s of the trans form s of theexcitatio n and vocal tract impulse response. Since the inversediscreteFourier ransform(IDFT) is a linear operation, heresult at D (called the cepstru m of the input at A ) s an addi-tive com bination of the cepstra of the excitatio n and vocaltract componen ts . Thus, the effect of the operations, window-ing, DFT, log magnitude, and IDF T is to approx imately trans-form convolution into addition. The value of this transforma-tion can lh seen from Fig. 17(b), which depicts the results ofsuch an analysis for voiced speech. Th e curve labeled A is theinput speech segment that has been multiplied by a Hammingwindow. The rapidly varying curve labeled C is the log-magnitude of the short- time transform. I t consis ts of a slowlyvarying compo nent due to the vocal tract transmission, and arapidly varying periodic omp onent ueohe er iod icexcitation.The slowly varying part of the log mag nitudeproduces he ow-timepart of the epstrum ( D ) , and therapidlyvarying period ic om ponen t of the log mag nitude

  • 8/13/2019 Digital Representations of Speech Signals

    11/18

    PROCEEDINGS OF THE IEEE, APRIL 19757 2

    the form of an all-pole model,

    then he cepstrum of the vocalvolution can be shown 301, [47 tract com ponen t of the con-to b en < On = O (22)n > 0

    If we assume that the excitation comp onent is a periodic trainof impulses, then it can be shown [30] that the cepstrum ofthe excitatio n com ponent will also be a train of impulses withthe 'samespacing as the nput impulse rain.This is clearlyreflected in the cepstrum for voiced speech in Fig. 17(b). Theimpo rtant point is that he cepstrum consists of an additivecombination in which (due to th e l /n falloff) the vocal tractand excitationcom ponents essentially donot overlap. Thesituation for unvoiced speech, shown in Fig. 17(c), is much thesame with the exception that the rand om n ature of the excita-tioncom ponent of the nput speech egment ( A ) causesarapidly varying random comp onent in the log magnitude' (C).Thus in thecepstrum (D),he low imecomponents corre-spond as before to the slowly varying vocal ract transfer func-tion; however, since the rapid variations of the log magnitudeare not, in this case, periodic, there is no strong peak as forthe voicedspeechsegment.Thus, thecepstrumserves as anexcellent basis for estim ating th e fund ame ntal perio df voicedspeech and for determining whether a particular speech seg-ment is voiced or unvoiced [291.The vocal tract transfer function, often called the spectrumenvelope, can be obtained by removinghe rapidly varying com-ponents of the log magnitude spectrum by linear ittering. Oneapproach to this ilteringoperation nvolvescomputing heIDFT of the log magnitude spectrum (t o give the cepstrum ),multiplying the cepstrum by an appropriate window that onlypasses the short-tim e com ponents,nd then computing the DFTof the resulting windowe d cepstrum. This method correspondsto th e fast convolution method [451-[491, in this case beingapplied t o filter a function of frequency rather than a functionof time . Th e results for voiced and unvoiced speech segmen tsare labeled E in Figs. 17(b) and (c), espectively.The sm oothed spectrum obtained by the above method is inmany respects comparable to, a short-time spectrum obtainedby direct nalysis sing sho rtdata window. Themajordifference,however, is that hecepstrummethod is basedupon he nitial omp utation of a arrow-band pectrum,whichnvolveswideimewindow,while the wide-bandspectrum is comp uted using a very narrow-time window. Th esmoothing is doneupona narrow-band og-magnitudespec-trum rather than upon thehort:time Fourier transform itself,sis the case for wide-band analysis. Thus, for s peech segments inwhich the basic parameters such as pitch period and formantfrequencies are no t changing, we sho uld expect the cepstrum

    produce a better representation than the cepstrum method dueto its shor ter averaging time.B. Estimation ofFormant Frequencies and Pitch Period

    The results depicted n Fig. 17 suggest algorithms for esti-mating asic peech arameters uch as pitch eriod ndformant frequencies.Specifically,voiced/unvoiced classifica-tion of the excitation is indicated by the presence or absenceof astrong peak in the cepstrum [2 9] . The presence of astrong peak for voiced speech is dependent upon there beingmany harmonics present in the spectrum . In cases where thisis not true, such as voiced stops, zero crossing measurem entsare helpful in distinguishing voiced from unvoiced speech [341.If a strong peak is present, its location is a good indicator ofthe pitch period.

    The moothed pectrum etains peaks at he vocal tractresonances or forman t frequencies. One approach to estimat-ing the formants is to search the smooth spectra for eaks andthen decide which peaks correspondo formants [341. Anotherapproach uses iterative methods t o adjust the parameters of amodel imilar t o (21)untila good match to he moo thspectrum is obtained [331.An llustration of the use of homom orphic.processing isgiven in Fig. 18. On the left are shown a sequence of cepstracom puted at 2Gm s intervals. The stron g peak ndicates hatthe speech is voiced during the e ntire interval. On the righ t aresuccessive short-time spectra and homomorphically smoothedshort-time pectra.The lines connectinghepeaks of thesmooth spectra show he formant frequenciesautomaticallyestimated romhe pectrum eaks.The eak-picking p-proach is relatively simple excep t when two forman ts mergeas in the hird and fourth frames from he op and the ast4 frames from the bottom. In this case it is useful to evaluatethe vocal tract ransfer function on a contour whichpassescloser t o the poles thereby sharpening the resonances 341 .Speech can be synthesized from formant and pitch data byusing the estimate d param eters to vary the parameters of themodel ofFig. 3. Withefficientcodingof heparameters,speech is thus represented by about 1 OOO bits/s [21 . In addi-tion to this high efficiency, the formant representation offersgreat flexibility in manipulating basic speech parameters. l s o ,since so much of the speech model is built into the representa-tion, these parameters are very useful for othe r purpos es suc has speec h reco gnition and s peake r erification.C. The Cepstmrn as a Represe ntation of Speech

    The low-time samples of th e cepstrum contain mostly infor-mation about the vocal tract transfer function H z) of (21).It can be shown [311, [47 ], tha t the following recurrenceformula rga tes the vocal tract mpu lse response h ( n ) t o th ecepstrum h ( n ) of (22):

    n = 0. (23)Also using ( 23) it is easily show n that the coefficients a , in(2 1 are related to the cepstrum by

    SCHAFER AND RABINER: SPEECH SIGNALS 673

  • 8/13/2019 Digital Representations of Speech Signals

    12/18

    CEPSTRA SPECTRA

    Ax n)

    Fig. 19. Digital model for speech production,

    dictor can be applied in a differential quan tization scheme toreduce the bi t ra te of the digital representation of the speechwaveform. In this case, as n inearpredictive analysis, th epredicto r coefficients m ust be adapted (i.e., updated regularly)to ma tch the ime-varying prope rties of th e speech signal.A . Fundamental Principles

    The use of linearpredictive analysis is suggested b y hedigital model of Section 11. Assume that samples of the speechsignal are produced by the model of Fig. 3, where over a shorttime interval th e linear system has the transfer functionA

    H ( z ) = (25)1 - U k f kPk= l

    Fo r voiced speech, the system is excited by an impulse trainand for unvoiced speech it is excited by rand om white noiseas depicted in Fig. 19. Linear pr ediction analysis is based onthe observation that for sucha system the speech samples x ( n )are related to the excita tion S ( n ) by the following differenceT I M IN msec FREQUENCY IN kHz

    Fig. 18. Cepstra and spectra for a region of voiced speech.

    Since the cepstrum contains all of the nformation of theshort-tim e spectru m, t can be viewed as s t i l l another repre-sentat ion of the speech signal. This principle has been appliedin a speech analysis synthesis scheme called the homo mo rphicvocoder [ 3 2 ] . In this system , he low-time cepstrum valuesand an estimate of pitch period serve as a representat ion of th espeech signal from w hich m-a cou stic wave can be recons tructed.

    VII. LINEAR REDICTIVENALYSISAmong the mo st useful methods of speech analysis are thosebased up on the principle of l inear predict ion. These metho dsare important because of their accuracyand their speed ofcomputation. n this section, we present a formulation oflinear predictive analysis and discuss som e of th e issues whichare involved in using it in practical speech applications.The basic idea behind linear pred ictive coding (LPC) is that asample of speech can be approximated as a linear combinationof the past p speech samples. By minimizing th e square differ-ence b etween the actual speech samples and the linearly pre-dictedones, one can determine hepredictor coefficients;i.e., the weighting coefficients of the linear combination.The basic philosophy of this scheme is reminiscent of and , in

    equation:P

    k = lx ( n ) = n k x ( n - k) + 6 ( n ) . (26)

    Suppo se tha t we process the speech signal with a inear pre-dicto r; i.e.,Px ( f l ) = a k x ( f l - k).

    k=lThen the predictor errors defined as

    Pk = l

    d n ) =x ( n ) - x (n) = x ( n ) - a k x ( n - k). (27)Note tha t n t scase the predict ions based on the unqu antizedsamples x ( n ) , whereasAin Sec tion 111, the p red iction w as basedon quan tized samples x ( n ) . It can be seen by comparing (26)and (27) tha t if ak = a k , and if the speech signal reallydoesobey hemodel of (26) , then c ( n ) = 6 ( n ) . Therefore,between theexcitation impulses of voiced speech, the pre-diction rror should be very small if the redictor coef-ficients a k are equal to th e parameters ak of the vocal tracttransfer function. Thus the predictor polynomial

    674 PROCEEDINGS O F THE IEEE, APRIL 1 9 7 5

  • 8/13/2019 Digital Representations of Speech Signals

    13/18

    is a go od a pproxim ation to the d enomina tor f the vocal tracttransfer function.'One approach for obtaining the predictor coefficientss basedon min,imizing the average squared prediction error ver a shortsegment of the speech waveform. That is, we search for thevalues of a k that minimizeN - 1Ez = (xr (n) - %(n))'n = o

    P=y x l ( n ) - a & x l ( n - k)l (28)n = o= 1where x l ( n ) is a segment of s peech that has been selected inthe vicinity of sample I ; i.e.,

    q ( n )= x ( n + I .There are two basic waysof choosing x l ( n ) each eading toprocedu res that are some wha t different in the details of theirimplem entation nd he esults hat are obtained . Leavingx l ( n ) unspecified for now , we can find the values of ak thatmin imb e El in (28) by sett ing a E z / a & = 0, i = 1, 2, , ,thus obtaining the equations

    1 < i < p . (29)If we defiie

    n = othen (29) canbe written more compactlyas

    Pk=1CakIpr i,) = cpl i, 0 ) , i = 1 , 2 , * * , . (31)

    This set of p equations in p unknowns can be solved for theunknown redictor oefficientshatminimizehe averagesquared prediction error for he segment x j ( n ) . To do th is ,the quantities cpl(i,k) must be computed for 1 < < andI < k < . The details of this computation depend upon howq ( n ) s d e f i e d .By a simple substitution of variables, (30) can be written as

    N - 1 - in = - icpr(i, k) = x l ( n ) q ( n + i - k)

    N - 1 - k= xl(n)xl(n + k - ) . (32)n = - kClearly, cpl i, k)= cpt(k, i ) . We observe from (32) thatvalues ofx l n ) are required outs ide the interval 0<

  • 8/13/2019 Digital Representations of Speech Signals

    14/18

    ever, if computational efficiency is important, as it usually is,some special prop erties of the matrix CP can be exploited toreduce omp utation. n he case of (3 1) the covariancemethod) CP is sym metric and positive definite. Utilization ofthis fact leads to an efficient procedurefor solving for hevector a of predictor coefficients that is based onmatr ixfactorization. This metho d is called the square root metho d,or theCholesky decom posit ion [371.Similarly, fo r heautocorrelat ionmethod hematr ix @ issymmetric and positive definite and also has the property thatthe eleme nts along anydiagonal are equal.Sucha matrix iscalled a To eplitz matrix and in this case an even more efficientmeth od or solving theequa t ions can be found [4 3] . Thismethod is called the Levinson method.Since com puta tiona l efficiency is an impo rtant considerat ionin any practical speech analysis schem e, it is worthwhile com-paring hese two metho ds of linear prediction n this sense.The square ootmeth od or solving the covariance me thodformulation requires on the orderof p 3 operatio ns (mu ltiplica-tions) whereas th e Levinson meth od for solving the autocorrela-t ion formulation requires on the order of p 2 operations. Thusthe solution of the equation for the autocorrelat ion formula-tion is inherently faster com putation ally than for the covari-ance formulation. nparticular, or p = 14, Makhoul ndWolf [39] ote a rat ion omputa t ionime of 3.2 to1 in favor of the utocorrelat ionmethod. However ,hissavings in omputa t ion is not significant when viewed inthe total framework of the me thod for two reasons. First thet ime required to compu te the matr ix of correlat ions is signifi-cantly greater than the t ime to solve the matr ix equation. Forexample, for N = 50, Makhoul and Wolf [ 39 ] no t e ha t ttakes ten times longer to comp ute the matr ix then to olve thematr ix equations using the autocorrelat ion method. Thus thesavings in computation of the Levinson method becomes muchless significant. As a second consideration the value of N e-qui red for both methods is not th e same. For the autocorrela-tion me tho d (for 10-kHz sampling) a value of N n the range15 0 to 30 0 s generally required. For the covariance metho d amu ch smaller value of N can be used if care is take n to beginthe se ction after pitch pulse. In f act, A tal repo rts sing valuesof N on the order of 30 wi th good result s [361. Thus the reare many factors which determine comp utationalefficiency.Another dif ference betweenhew omethods concernsthe roots of the predicto r polynomial which are the poles ofthe digital fiiter that accounts for the vocal tract transmissionproperties. For stabil ity of t is system, the roots must be in-side the unit circle of the z plane. This is no t guaranteed bythe covariance method [361 ; owever, given suf f ic ient compu-tationalccuracy theutocorrelat ionmethod guaranteess ta b il it y [ 391 , [ 4 3 ] .

    Another consideration in using these twomethods is thenumerical stability of the ma trix inversion. Wilkinson [441 hassho wn th at the square-root m etho d is very stable numerically;no such statemen t has been m ade for he Levinson method.Markel [431 has pointedout hat when implemented withfinite precision arith me tic, the Levinson meth od requires care-ful scaling, and it is beneficial if the speech spectrum hasbeenequalized by a simple first-ordernetwork.Un til now we have dealt with considerations which can be

    -10-20 -3oL

    FREPUENCY ( k H z 1Fig. 20. Comparison of speech pectra. (a) Obtainedby epstrumsmoothing. (b) Obtained by linear prediction.

    digital (A/D) conversion; the effects of fin ite word lengt h onthe analysis; the desirability of various structu res for imple-men ting the system ; nd finally the ease of building the variousalternativesn digital hardware. Markel [43 ]has providedsome excellent insights into several of these issues but most ofthem are as yet unresolved.C. Uses ofLinear Prediction Analysis

    Once the predicto r coefficients have been o btained , the y canbe used invarious ways to represent thepropert ies of thespeech signal.I Spectrum Estimation: If the pred ictor polynomial is as-sumed to represent th e d eno min ator of the vocal tract transferfun ction , we can obtain the freque ncy response of the vocaltract ( for a par t icular segment of the speech signal) as

    An example is shown in Fig, 20, where the sp ectrum obtainedusing (37) with thepredicto r coefficients estimated by theautocorrelat ion m ethod is compared to that obtained by cep-strum smoothing for the ame segment of speech. The formantfrequencies are clearly in evidence in both plots , however, Fig.20(b) has fewer extrane ous peaks. This is because p was chosenso that a t most 6 (p = 12 ) resonance peaks could occur. T o de-termine the appropriate value of p for a given sampling rate, agood rule of thum b is to allow one pair of poles to account forradiationand glottal effects,andone pair of poles for eachformant f requency expected in the f requency range 0 Q w Qn/T. Thus, for a 10-kHz sampling rate we expectnotmor ethan 5 form ant f requencies so p = 12 sho uld give a good repre-sentation of th e spectru m. For unvoiced speech it has beenshown th at a reasonably small predictio n error can be obtainedwith a value of p on the o rder f 12 [361 , 4 3 .Another point to not ice is that the spectrum peaks n Fig.20(a) are much broader than thepeaks in Fig. 20(b). This is aninherentproper ty of the omom orphicmeth od since theFig. 20(a) was obtaine d by smo oth ing he hor t-tim e logspectrum.2 ) FormantFrequencyEstimation: Smooth pectra uchas Fig. 20(b ) have been used in a peakpicking algorithm to esti-mate formant frequencies in much the same manner as spectrasuch as Fig. 20(a) were used [4 1 ] .

    676 PROCEEDINGS OF THE IEEE, APRIL 1975

  • 8/13/2019 Digital Representations of Speech Signals

    15/18

    - 1 10.4 0.8 1.2 1.6TIME ISEC)

    (b)Fig. 21. (a) Spectrogram of predictor polynomial. (b) Robts of pre-dictor polynomial (after Atal [ 361).

    t rogram that the rootsf the predicto r polynomial are generallyvery good estimates of theformant frequencies. As with allformant analysis problems the difficulty in the problem lies ingiving aparticular fo rm an t label to a pole. Several reliablealgorithms exist for doing this job [341 , [4 1 .3 ) Pi tch Detect ion: We recall that if we use the predicto rcoefficients as in our original formulation, then the predictionerror

    PE ( n ) = x ( n ) - a k x ( n - k ) (3 8)k=lshould appear very mu ch like the excitation functio n 6 ( n ) inFig. 19. Thus i t might be expected that the predict ion errori g -nal mightbe useful as a star t ing point for determining propert iesof the exc itation ; i.e., pitch period a nd voiced/unvoiced de-cision. Several procedu res of this typ e have been suggested1401, [ 4 2 1 .4 ) Rela t ion to the Cepstrum and Autocorrelation Function:n addition to the aforementioned, the basic linear predictioncoefficients can be transformed d irectly into a num ber of otherrepresent$ions of the speech signal. Fo r exam ple by solving(24) for h ( n ) , we obtain the recurrence formula

    (39)

    sponse defined asr ( m ) = h ( n ) h ( n + m )OD

    n=osatisfies the recurrence form ula

    5) Speech Synthesis: Finally, the pred ictor coefficients andexcitation information can be used in the model of Fig. 19 torecon struct a speech waveform [ 3 6 1 . n t h i s case it is neces-sary to est imatehe onstant A in 2 5 ) as well as theparameters of the predicto r polynomial. This can be done aspart of the c om put ation of the predictor coefficients [ 4 3 ] butin mo st cases A is simply chosen to ma tch the energy of thesynthetic speech to the energy of the original speech [ 3 6 1 .D. Discussion

    The underlying structure of linear prediction analysis is tha tover shor t sections of speech one can accurately predict hecurrent speech ample f romhe preceding p samples. Al-thoug h a wide variety of d ifferent form ulations of this methodhave arisen, the inhere nt imilarities between m etho ds are mu chlarger than the supposed differences. To make all the decisionsas to which par ticular metho d to use, what section durationetc. , on e must pay str ic t a t tention to the ul t imate applicat ionof the method. Thus for most speech recognition applications,for exam ple, the differences between form ulations are n o t sig-nificant. Forthermor etring ent applications, such asanalysis/synthesis, the differences may indeed be quite signifi-cant and may m ean the difference between an acceptable anda nonacceptable system.

    VIII. SUMMARYIn this paper, we have discussed a wide variety of digitalrepresentations of speech signals. These rep rese ntati ons havevaried in com plexity, nform ation ate,and flexibility fromsimple waveform coding schemes to analysis schemes such as

    hom om orph ic filtering and linear predictio n analysis which aredirected toward the estimation of the parameters of a detailedmodel of speech production. We have focused our attention al-most exclusively on analysis techniq ues that are of wide ap-plicability. Th e results of most of these tech niq ues can beapplied in a variety of speech processing applications includingspeech recognition, speech synthesis, and speaker verification.REFERENCES

    General[ 1 1 G. Fant, Acoustic Theory of SpeechProduction. The Hague,The Netherlands: Mouton, 1970.[ 2 ] J . L. Ranagan, C. H. Coker, L. R. Rabiner, R. W . Schafer, and

    vol. 7, pp. 22-45, Oct. 1970.N.Umeda, Syntheticvoices orcomputers, IEEE Spectrum,131 J . L. Ranagan, Speech A d y s i s , Synthesis and Perception, 2nded.New York: Springer-Verlag, 1972 .

    SCHAFER AND RABINER: SPEECH SIGNALS 677

  • 8/13/2019 Digital Representations of Speech Signals

    16/18

    [SI P. Cummiskey, N. S. Jayant, nd J . L. Flanagan, Adaptivequantization in differential PCM coding of speech, Bell Syst.[ 61 N. S. ayant, Adaptive delta modulation with a one-bitmemory,Tech. J.,pp. 1105-1118, Sept. 1973.Bell Syst. Tech. J . , pp. 321-342, Mar. 1970;[ 7 ] -, Digital coding of speech waveforms, Proc. ZEEE., vol. 62,pp. 611-632, May 1974.[81 R. A. McDonald, Signal-to-noise and idle channel performance ofDPCM systems-particularpplication to voice signals, BellSyst. Tech. J.,pp. 1123-1151, 1966.[ 91 J . Max, Quantizing for minimum distortion, IRE Trans Inform.Theory,vol. 1T- pp. 7-12, Mar. 1960.[ l o ] L.H. Rosenthal, R. W. Schafer, and L. R. Rabiner An algorithmfor locating the beginning and end of an ut terance using ADPCMcoded speech, Bell Syst . Tech. J . , vol. 53, pp. 1127-1135, July-Aug. 1974.[ 1 1 1 L. H. Rosenthal , L. R. Rabiner, R. W. Schafer, P.Cummiskey, andJ. L. Flanagan, A multiline computer voice response systemutilizing ADPCM coded speech, IEEE Trans. Acoust., Speech,and Sig. Processing, vol. ASSP-22, pp. 339-352, Oct. 1974.

    Time-Domain Methods[ 121 B. Gold, Note on buzz-hiss detection , J . Acoust. SOC. Amer.,[ 131 B. Gold and L. R. Rabiner, Parallel processing techniques orestimating itch eriods of speech in the time omain, J .[14] D. R. Reddy,Computer recognition of connectedspeech, J .Acoust. SOC. Amer.,ol. 46, no. 2, pp. 442-449, Aug. 1969.Acoust. SOC.Amer., vol. 42, no. 2 , pp. 329-347, Aug. 1967.[ 151 M. M. Sondhi, New methods of pitch detection, ZEEE Trans.Audio Electroacoust., vol. AU-16, pp. 262-266, June 1968.

    vOI. 36, pp. 1659-1661, 1964.

    Short-Time Spectrum Analysis[ 161 C.G. Bell, H. Fujisaki, J. M. Heinz, K N. Stevens,and A. S.House, Reduction of speech spectra by analysis-by-synthesistechniques, J. Acoust.SOC.Amer., vol. 33, pp. 1725-1736,[ 17) R. B. Blackman and J. W. Tukey, The Measurement of PowerDec. 1961.[ 181 I . L. Flanagan and R. M. Golden, Phase vocoder, Bell Syst.Spectra. New York: Dover, 1959.

    ~~Tech. J . , VOI. 45, pp. 1493-1509, NOV.1966.of speech, IEEE Tram Audio Electroacoust., vol. AU-15, pp.B. Gold and C. M.Rader, Systems for compressing the bandwidth131-135, Sept. 1967; and The channel vocoder, ZEEE Trans.Audio Electrmcoust., vol. AU-15, pp. 148-160, Dec. 1967.T. Martin, Acousticecognition of aimited vocabulary incont inuous speech, Ph.D. d issertation, Univ. Pennsylvania,Philadelphia, 1970. Available from Univ. Microfdms, Ann Arbor,Mich.)P. Mermelstein. Comwuter Penerated s ~ec tro eram dis dav s foron-line speech research; ZEEk rans Audio Eiectroac&g., vol.

    [22] A. M. Noll, Pitch de termination of human speech by the har-maximum likelihood estimate, in Computer Processing in Com-monic product pectrum, he harmonic sum pectrum,and amunicationsProceedings, J . Fox, Ed. New York: PolytechnicPress, 1969.

    [ 231 A.V. Oppenheim, Speech spectrograms using the fast Fouriertransform,ZEEESpectnrm,vol. 7, pp. 57-62, Aug. 1970.[ 2 4 ] R. W.Schafer and L. R. Rabiner, Design of digital filter banksfor speech analysis, Bell Syst. Tech. J . , vol. 50 no. 10, pp.3097-3115, Dec. 1971.[251 -, Design and simulation of a speech analysis-synthesis systembased on short-time Fourier analysis, ZEEE Trans.Audio Electro-acoust., vol. AU-21, pp. 165-174, June 1973.[ 2 6 ] M. R. Schroeder, Vocoders: Analysis and synthesis of speech,Proc. IEEE, vol. 54, pp. 720-734, May 1966.(271 M. R. Schroader, Period histogram and product spEctrum: Newmethods or fundamental-frequencymeasurement, J. Acoust.

    AU-19, p ~ .4-47, Mar. 1971.

    SOC.Amer., vol. 43, no. 4 , pp. 829-834, Apr. 1968.(281 H. R. Silverman and N. R. Dixon, A parametrically controlledspectral analysis systemor peech, ZEEE Trans. Acoustics,Speech, and Sig. Processing, vol. ASSP-22, pp. 362-381, OCt.1974.Homomorphic Speech Analysis[ 2 9 ] A. M. Noll, Cepstmm pitch determination, J . Acoust.SOC.[ 301 A. V. Oppenheim and R.W. Schafer, Homomorphic analysis ofAmer., vol. 41, pp. 293-309, Feb. 1967.speech, ZEEE Trans Audio Electroacoust.,vol. AU-16, pp. 221-226, June 1968.[ 3 1 ] A.V. Oppenheim, R. W. Schafer, and T. G. Stockham, Jr., Non-linear fdtering of multiplied and convolved signals, Proc. IEEE,vol. 56, pp. 1264-1291, Aug. 1968.[ 321 A. V. Oppenheim, A speech analysis-synthesis system based onhomomorphic filtering, J. Acoust. SOC. Amer.,ol. 45, pp. 458-465, Feb. 1969.[ 3 3 ] J . Olive, Automatic ormant tracking in aNewton-Raphsontechnique, J. Acoust. SOC.Amer., vol. 50 pt. 2, pp. 661-670,Aug. 1971 .[ 341 R. W. Schafer and L. R. Rabiner, System for automatic form antanalysis of voiced speech, J . Acoust. SOC.Amer., vol. 47, no. 2,pp. 634-648, Feb. 1970.Linear Prediction AnalysisI351 B. S. Atal and M. R. Schroeder, Adaptive predictive coding of[ 3 6 ] B. S. Atal and S. L. Hanauer, Speech analysis and synthesis byspeech signals, Bell Syst. Tech. J . , vol. 49, 1970.linear prediction of the speech wave, J. Acoust. Soc. Amer., vol.1371 D. K. Faddeev and V. N. Faddeeva, Computational Methods ofLinear Algebm. San Francisco, Calif.: Freeman, 1963.[ 381 F. Itakura and S. Saito, An analysis-synthesis telephony system

    Japan, vol. 53A, pp. 36-43, 1970.based on maximum likelihood method, Electronics Commun(391 J. I. Makhoul and J . J . Wolf, Linear prediction and the spectral

    50, pt. 2, Pp. 637-655, Aug. 1971.

    analysis of speech, Bolt, Beranek, and Newman Inc., Boston,Mass., BBN Rep. 2304, Aug. 31,1972.J . N. Maksym, Real-time pitch extraction by adaptive predictionvol. AU-21, pp. 149-153, June 1973.of the speech waveform, IEEE Trans AudioElectroacoust.,I . D. Markel, Digital inverse ffitering-A new tool for formanttrajectoryestimation, IEEE Trans. AudioElectroacoust., vol.J. D. Markel, The sift algorithm for fundamen tal frequency esti-AU-20, pp. 129-137, June 1972.mation, ZEEE Trans. Audio Electrwcoust.,vol. AU-20, pp. 367-377, Dec. 1972.J . D. Markel, A. H. Gray, Jr., and H. Wakita, Linear predictionof speech-theoryandpractice,SpeechCommunications Res.Lab.. Santa Barbara, Calif., SCRL MonomaDh 10. SeDt. 1973.[44] J. H: Wdkinson, Roundins Errors in Algebraic Roc&es. Engle-wood Cliffs, N.J.: Prentice-Hall, 1963.

    Digital Signal Procesnng[45] B. Gold and C. M. Rader,DigitalProcesring of Signals. New York:[461 H.D. Helms, Fast Fourier transform method of computing dif-McGraw-Hill, 1969.ference equationsand simulating ilters, ZEEE Trans. AudioElectroacoust., vol. AU-15, no. 2, pp. 85-90, June 1967.[ 4 7 ] A. V. Oppenheim and R. W. Schafer, Dw ta l Signal Processing.Englewood Cliffs, N.J.: Rentice-Hall, 1975.[48] L. R. Rabiner and B. Gold, Theory and Application of Digitcrl Sig-nal Processing. Englewood Cliffs, N.J.: Rentice-Hall, 1975.[ 4 9 ] T. G. Stockham, Jr., High speed convolution and correlation,AFIPSPrOc., pp. 229-233, 1966.

  • 8/13/2019 Digital Representations of Speech Signals

    17/18

  • 8/13/2019 Digital Representations of Speech Signals

    18/18