Estimating Speech Parameters October, 2006 Oregon Health & Science University

1

Estimating Speech Parameters

October, 2006Oregon Health & Science University

OGI School of Science & Engineering

John-Paul Hosom

2

Estimating and Detecting Speech Parameters

Topics for this lecture:

1. Computing Energy2. Linear Predictive Coding (LPC) (and Estimating Formants)3. Estimating Pitch4. Detecting Glottalization5. Detecting Bursts

• Except for energy, these parameters can not be computed with 100% accuracy. Therefore, they are not often used in automatic speech recognition systems.

• However, reliable estimation/detection of these parameters can be useful for voice transformation, speech synthesis, and speech analysis.

• Pitch estimation methods are quite numerous. Four basic methods and one new method will be presented here. Methods for detecting glottalization and bursts are less prevelant. A few methods for each will be presented here.

3

Energy

“Energy” or “Intensity”:intensity is sound energy transmitted per second (power) through a unit area in a sound field. [Moore p. 9]

intensity is proportional to the square of the pressure variation [Moore p. 9]

normalized energy = = intensity

xn = signal amplitude x at time sample nN = number of time samples

N

xNt

tnn

12

4

Energy

“Energy” or “Intensity”:human auditory system better suited to relative scales:

energy (bels) =

energy (decibels, dB) =

I0 is a reference intensity… if the signal becomes twice aspowerful (I1/I0 = 2), then the energy level is 3 dB (3.01023 dBto be more precise)

Typical theoretical value for I0 is 20 Pa. (20 Pa is close to the average human absolute threshold for a 1000-Hz sinusoid.)

Typical practical value for I0 is 1.0

)(log0

110 I

I

)(log100

110 I

I

5

Energy

What is a good value of N? Depends on information of interest:

N=1 msec

N=5 msec

N=20 msec

N=80 msec

6

Autocorrelation:measure of periodicity in signal

LPC: Background: Autocorrelation

m

kmxmxk )()()(

time

ampl

itude

KkkmymykRkN

mnnn

0)()()(1

0

7

LPC: Model

Linear Predictive Coding (LPC) provides• low-dimension representation of speech signal at one frame• representation of spectral envelope, not harmonics• “analytically tractable” method• some ability to identify formants

LPC models speech as approximate linear combination of previous p samples:

where a1, a2, … ap are constant for each frame of speech.

We can make the approximation exact by including a“difference” or “residual” term, e(n) = G u(n), which is the excitation of the signal if the LPC coefficients are a filter:

)()2()1()( 21 pnsansansans p

p

kk

p

kk nGuknsaneknsans

11

)()()()()(

where u(n) is the excitation of the filter and G is a gain term.

8

LPC: Model

If we define the error over some range of values M1 to M2 as:

Then we can find ak by setting En/ak = 0 for k = 1,2,…p, obtaining p equations and p unknowns. After some derivation, the p equations can be related to the autocorrelation coefficients Rn(i):

p

knnk piiRkiRa

1

1)(|)(|

2

1

2

2

1

2

1

)()(

)(

M

Mm

p

knkn

M

Mmnn

kmsams

meE

and an exact solution to this system of linear equations can beobtained using an iterative process called “Durbin’s Solution”

9

LPC: Predictor Coefficients as Filter Coefficients

The error term e(n) can be written as

Taking the z-transform of this equation (noting the time-shift property of the z-transform, ):

where A(z) is a transfer function specified by the LPC coefficients. We can write the original signal in terms of the error signal E(z) and the transfer function A(z), and approximate E(z) by a constant gainterm (since the error should have a flat spectrum):

The LPC coefficients are therefore an all-pole (IIR) filter that models the spectral shape (spectral envelope) of the input speech (formants and spectral tilt due to glottal source).

)()(

)()(

zA

G

zA

zEzS

p

kk knsansnsnsnGune

1

)()()(~)()()(

)()(1)()()()(11

zAzSzazSzzSazSzEp

k

kk

p

k

kk

)()}({ zXzknxZ k

10

because , the log power spectrum is:

LPC: Spectral Representation

We can compute spectral envelope magnitude from LPC parameters by evaluating the transfer function S(z) for z=ej:

22

2

2

2

122

11

}Im{}Re{log10

}Im{}Re{log10)(

0)2

sin(}Im{)2

cos(1}Re{

AA

G

AA

Gn

NnN

nkaA

N

nkaA

p

kk

p

kk

Each formant (complex pole) in spectrum requires two LPC coefficients; each spectral slope factor (frequency=0 or Nyquist frequency) requires one LPC coefficient.

For 8 kHz speech, 4 formants LPC order of 9 or 10

p

k

kjk

jj

ea

G

eA

GeS

1

1)(

)(

)sin()cos( je j

11

LPC: Spectral Representation

12

LPC: Estimating Formants

The transfer function

can be re-written as a product

p

k

kk zazA

1

1)(

p

kk zzzA

1

1)1()(

where zk are the roots of the predictor polynomial.

Roots for (resonant) poles that aren’t at 0 Hz or the Nyquist frequency will occur in pairs, symmetric around the real (x) axis. If we solve for the roots zk, we can determine the frequencies and bandwidths of the poles.

real(z)1-1

j

-j

imag

inar

y(z)

r

r

-

13

LPC: Estimating Formants

We can express these complex roots (or poles of the filter) in terms of angle and radius on the unit circle by converting from Cartesian coordinates in the complex plane to polar coordinates .

Angle corresponds to frequency, and radius corresponds to bandwidth. So we can determine the pole (or resonant) frequencies and bandwidths (converting to Hz) as:

Formants are typically the resonances with the smallest bandwidths.

skk

sk

s Fzz

Fz

Fr 22 }Im{}Re{log))(abslog()log(bw

})Re{},(Im{2arctan22

frequency kkss zz

FF

14

Pitch Estimation: Autocorrelation Method

Autocorrelation of speech signals: (from Rabiner & Schafer, p. 143)

15


Autocorrelation (AC) can be used to determine F0, by finding the local maximum in the AC signal that is (a) within range of expected F0 values and (b) above a threshold:

However, the local maximum does not always correspond withthe correct T0 value (F0 = 1/T0)

30 samples = 8000/30 = 266 Hz = F0max

100 samples = 8000/1000 = 80 Hz = F0min

16


Problems:

1. A high F0 (e.g. 180 Hz) will have two peaks within range, e.g. 180 Hz and 90 Hz. This may cause “pitch halving” error

2. Formants will influence the strength of peaks. For example, if F0 is 120 Hz, but the strongest energy in the waveform is due to the first formant at 240 Hz (e.g. the vowel /i:/), the highest local maxium in the AC may be at 240 Hz (“pitch doubling” error)

Want an F0 estimation method that is not sensitive to formants

120 240 360 480 600

17

SIFT Method: LPC analysis (order 5) of low-pass-filtered waveform (800 Hz) inverse filter: obtain signal without formants autocorrelation: measure periodicity decision based on height of autocorrelation peak

Pitch Estimation: SIFT Method

frequencytime frequency

frequency frequency time

4000 Hz 800 Hz

18

Problems with SIFT:

1. For some sounds (e.g. nasals, vowel-to-silence transitions) the signal is dominated by a single harmonic (close to sine wave).LPC analysis and inverse filtering can remove the formant information and the glottal-source information, leaving only white noise.

2. Still have the problem that a high F0 (e.g. 180 Hz) will have two AC peaks within range, e.g. 180 Hz and 90 Hz. This may cause “pitch halving” error

Rather than use autocorrelation, we can measure F0 from information in the spectrum, by identifying harmonics

Pitch Estimation: SIFT Method

19

Pitch Estimation: Harmonic Sieve Method

F0 estimation in the spectral domain often uses the “harmonicsieve” method, that relies on the fact that F0 harmonics must occurat multiples of the fundamental frequency. If we sum the power-spectrum energy values at multiples of a given frequency, then the frequency value that yields the largest energy sum should be F0.

test F0 = 100 HzNormalized sum of energies at 100, 200, …, 4000 Hz is maximum

test F0 = 110 HzNormalized sum of energies at 110, 220, …, 3960 Hz is relatively small

20

Pitch Estimation: Harmonic Sieve Method

Problems with harmonic sieve method:

1. In order to resolve harmonics, need high frequency resolution in power spectrum. This requires a large number of waveform samples at each frame (e.g. 256 samples (32 msec) or more). F0 may change quickly within 30 or 40 msec, and these quick changes in F0 can not be reliably identified.

2. This method is susceptible to pitch-doubling errors, because (a) normalization by number of harmonics will reduce energy in low-F0 case to be approximately equal to energy of doubled F0, and (b) if normalization is not performed, bias toward lower F0 values that includes more harmonics.

21

cepstrum: treat spectrum as signal subject to frequency analysis…1. Compute log power spectrum2. Compute FFT of log power spectrum

Pitch Estimation: Cepstral Method

Cepstral method: F0 information is encoded in higher cepstralcoefficients

this peak indicates F0

time

ampl

itude

frequency

ener

gy (

dB)

ampl

itude

quefrency

frequency

ener

gy (

dB)

22


F0 estimation in the cepstral domain:(1) compute the cepstrum(2) determine if there are values within normal F0 ranges

that are above some pre-defined threshold(3) if there are such values, find the maximum value(4) F0 is computed from the inverse of the index of this maximum

F0max F0min

23


Problems with the cepstral method:

1. Often there are only a few harmonics (e.g. nasals, /w/) that identify F0. In this case, the height of the cepstral peak will be very low, leading to peak-location (F0) estimation errors

2. Humans can identify pitch from a small number of harmonics, and some of the harmonics may be “missing”. Cepstral method is not robust in these cases

200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k

perceived pitch = 200 Hz

missing fundamentalperceived pitch = ??? Hz

most harmonics missingperceived pitch = ??? Hz

200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k

200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k

24

Many errors in F0 estimation occur briefly (especially at the beginning and end of voiced sounds) and have a large difference from (a) the correct F0 value and (b) neighboring F0 values:

One method of obtaining a smooth F0 contour is to apply a Viterbi search to a number of F0 estimates at each frame. In this case, the transition probability is constrained so that large changes in F0 from one frame to the next are prohibited

Pitch Estimation: Dynamic Programming

AC

(T

0) v

alue

t = 1 2 3 4 5

transition from previousframe limited to range of neighboring T0 values

maximum value att=13 is here

25

Pitch Estimation: Band-Pass Method (new)

If the expected F0 range is known in advance, constraints (e.g. limiting F0max and F0min) can be applied to reduce these errors. For F0 estimation for both children and adults, the expected F0 range can be too large (e.g. 50 to 400 Hz) for such constraints to be effective.

The “band-pass” algorithm is based on an interpretation Moore’s summarization of pitch identification in humans. In the band-pass method, information from 32 band-pass filters is combined at every frame, and then a Viterbi search provides an F0 contour estimate.

This method does not require constraints on the range of F0, as one set of parameters can be used for adult and children’s speech. It does not utilize autocorrelation, LPC, harmonic seive, or cepstrum, and so it does not make assumptions about the correlation between pitch periods, nature of the glottal source, or existence of a large number of harmonics. It is robust to different formant frequencies.

26


STEP (1): FILTERING

The speech signal is passed through 32 9-tap IIR filters, with (narrow) filter bandwidths determined from the Equivalent Rectangular Bandwidth (ERB) scale:

ERB(f) = 0.108f + 24.7

where f is the center frequency, in Hz. As a result, in most cases, no more than one harmonic occupies one filter’s frequency range.

The first filter is centered at 100 Hz, and each subsequent filter has a center frequency approximately one-half bandwidth higher than the previous filter’s center frequency.

These filter outputs are approximately sine functions, with frequency equal to the harmonic within (or closest to) the filter’s bandwidth. In cases where multiple harmonics are within one band, the outputs are no longer simple sine functions, and are discarded.

27

Pitch Estimation: Band-Pass Method (new)STEP (1): FILTERING

Filters:

From To Bandwidth Idx82 118 36 199 137 38 2117 156 39 3135 177 42 4155 199 44 5177 223 46 6197 245 48 7221 272 51 8242 296 54 9… … …902 1031 129 26957 1092 135 271024 1167 143 281085 1235 150 291159 1318 159 301226 1393 167 311309 1484 175 32

28


STEP (2): FIND PERIODICITY

The period-to-period maxima of the (sine-wave) filter outputs are located in order to identify the periodicity of the signal at each frame. If the identified periodicity is beyond the frequency limits of the filter (noise or multiple harmonics), the periodicity is set to zero.

603-697 Hz

Find periodicity in each band at this frameby simple location of local maxima.

841-963 Hz

792-908 Hz

737-847 Hz

692-797 Hz

643-742 Hz

29


STEP (3): CREATE HISTOGRAM

For each frame (e.g. 1 msec), a histogram is computed:

(A) Initialize the histogram, with one bin for each periodicity value(154 values, representing 50 – 1334 Hz), with all bins set to 0.

(B) Determine the filter output with greatest energy at this frame, Me

(C) For each filter with energy > Me- (where = 12dB) andperiodicity > 0, the histogram is increased by 1.0 near thisperiodicity value p, within the range p-5 to p+5. This increase is repeated for all integer multiples of p.

(D) The maximum histogram value, Mh, and bin containing thismaximum, b, are determined. All values are normalizedby Mh. All values at bins greater than 2b are decreasedslightly (by a factor of 0.95) to avoid F0-halving errors.

Frequency is determined from periodicity by Fs/p, where Fs is the sampling frequency, and p is a periodicity value.

30


1500 Hz

0 Hz

500 Hz

1000 Hz

82-118 99-137 117-156135-177 155-199 177-223197-245 221-272 242-296269-326 293-352 323-386350-416382-452

412-485448-526

481-562522-607

558-648603-697

643-742692-797

737-847792-908

841-963902-1031

957-10921024-1167

1085-12351159-1318

1226-1393

1309-1484

889 Hz889 Hz889 Hz800 Hz727 Hz667 Hz

444 Hz400 Hz444 Hz242 Hz228 Hz235 Hz222 Hz

667 Hz

889 Hz = periodicity of 9 samples

9x1=889Hz(3 cnts)

9x2=444Hz

9x3=296Hz

9x4=222Hz

hist

ogra

m c

ount

216 Hz

9 12 15 18 21 24 27 30 3336 39 6972 75

9x8=111Hz

31

STEP (4): VITERBI SEARCH

A Viterbi search is performed on the sequence of histograms at each time frame. Transitions are constrained between frames t and t+1 to change by no more than 2 periodicity values (e.g. 2.5 Hz/msec when F0=100 Hz, 10 Hz/msec when F0=200 Hz). The result is the F0 contour with the largest global histogram value.


32


One issue is the possibility of pitch-halving errors, because lower F0 values can have an equally-large histogram count as the correct F0. This is avoided by (a) finding the first large peak, and (b) multiplying all T0 values above this peak by 0.95

Another issue is that this method will find an F0 value for all frames of speech, even frames that are unvoiced. (There is no threshold as in the cepstral method to determine whether a frame is voiced or unvoiced.)

33

Pitch Estimation: Band-Pass Method (new)Two corpora were used in evaluation, MWM and LSR corpora.

Average F0 for the MWM corpus was 118 Hz (range 50–250 Hz); average F0 for the LSR corpus was 250 Hz (range 96–402 Hz).

Evaluation was performed by (a) computing the average absolute difference between correct F0 and measured F0, over all frames at which an F0 value was obtained, (b) computing average percent error (absolute difference / correct F0), over all such frames.

In addition, results on the LSR corpus were manually compared with Kay Elemetrics’ CSL F0 estimation when the difference between the results in either vowel of a word exceeded 30 Hz.

Corpus Name

Speakers Material Recording Conditions

Correct F0 Measurement

MWM 1 adult male 450 sentences anechoic booth, high quality microphone

laryngograph-based, determined every 5

msec

LSR 35 children, ages 3-14

8 isolated 2-syllable words

Tascam cassette recorder

manual estimation at vowel centers

34


For the comparison with CSL on the LSR corpus, out of 33 words with at least one vowel having an F0 difference greater than 30 Hz, CSL had 30 errors greater than 30 Hz, while the proposed method had 8 errors greater than 30 Hz.

Corpus Avg. Absolute Difference

Percent Error

Standard Deviation

MWM 3.99 Hz 3.86% 12.19 Hz

LSR 6.20 Hz 2.58% 13.59 Hz

Method Avg. Absolute Diff. Standard Deviation

Harris & Nelson 9.88 Hz 30.08 Hz

SIFT 9.01 Hz 29.06 Hz

Modified SIFT 5.71 Hz 21.69 Hz

Band-Pass 3.99 Hz 12.19 Hz

Results:

35

Detecting Glottalization

What is glottalization? (Also called “creaky voice”)

Here, we define it as irregular or low-frequency (20-70 Hz)vibration of vocal folds during voicing. (The term “glottalization”also used when describing certain articulations of stop consonants)

Glottalization can occur quite often in some speakers, as a speakingstyle. It may occur frequently at end of sentence, when signaling phoneme boundary between two similar sounds (e.g. “E.E.”), or when signaling word boundary between potentially ambiguous words (e.g. “heavy oak” vs. “heavy yoke”). It may also occur morefrequently when a speaker has been talking a lot.

Very little published work on detecting glottalization. However,glottalization makes F0 estimation and synthesis difficult, and itmay be a relevant factor in diagnosing speech disorders.

36


Three techniques: PtP Amplitude, Feature Classification, Autocorrelation

Peak-to-Peak Amplitude (Cole, 1988)

Compute peak-to-peak amplitude (difference between max and min amplitude) of signal using variable-length analysis window.

Assumption: glottalization has F0 significantly smaller than surrounding voiced sounds. Therefore, F0-related amplitude changes can be detected using analysis window length determined from 1.3 times the median F0.

wave

spectrogram

phonemes

long-term F0

PtP amplitude

37



Feature Classification (Hosom, 2000)

Use neural-network classifier with (a) standard MFCC or PLPfeatures, or (b) standard features plus relative change in energyusing analysis window of 2(long-term F0).

Assumption: standard classifier can identify changes in energy and source characteristics using standard features, or standard features augmented with feature similar to PtP feature.

Results (insertion and deletion errors, within 20 msec):

CorpusBaseline

error rate (%)Baseline + Rel.Energy

error rate (%)

TIMIT 13.23 12.08

Stories 17.43 17.78

Portland Cellular 16.54 16.89

38



Autocorrelation (Ishi, 2004)

Estimate glottal-source waveform by inverse-filtering LPCcoefficients. Compute autocorrelation of glottal-sourcewaveform.

Assumption: long delay between impulses in glottalized speech yields non-zero correlations in between glottal pulses

norm

al s

peec

h

glot

tali

zed

spee

ch

(from Ishi, 2004)

39



A decision tree was applied to various parameters determined from the first two autocorrelation peaks (e.g. relative peak amplitude, relative peak position)

Decision tree yielded error rate of 21.6%. However, definition of “creaky” included abnormal F0 patterns as well as low-F0 patterns.

40

Detecting Bursts

What are bursts?

Increase in energy that is characteristic of stop consonants, after the closure, when buildup of pressure has been released

Not much prior work on detecting bursts. However, detecting bursts is critital for measuring voice-onset-time (VOT), which is the time from burst to onset of voicing.

VOT is important for phoneme identity (e.g. distinguishing /p/ from /b/), and may be important in detection of Parkinson’s Disease (PD), where control over VOT may be reduced, leading to reduced intelligibility.

41

Detecting Bursts

Four basic methods: Change in Energy, HMM, SVM Classifier, and Candidate Selection.

1. Change in Energy (Liu, 95)bursts characterized by closure (silence) then burst (high energy), so compute change in energy over entire utterance; if the change is above a threshold, mark as a burst

2. HMM (Niyogi, 99)use a phoneme-level HMM to identify all phonemes in an utterance. Beginning of each plosive is identified as a burst.

3. Support-Vector Machine (SVM) Classifier (Niyogi, 99; Keshet, 01)two SVMs implemented, for linear and non-linear classification, using as features log energy of entire spectrum, log energy of 3 to 8 kHz, and a spectral flatness measure

4. Candidate Selection (Hosom, 00)select “candidate bursts” based on change in relative energy; classify candidates using ANN with cepstral features.

42

Detecting Bursts

4. Candidate Selection Method (in detail):

Generate Candidates: Measure relative change in energy at eight Bark-scale frequency bands. Perform equal-loudness weighting of energy bands, so that perceptually-relevant bands have greater weight. Transform energy values to scale 0 to 1, representing “probability of burst in this band”. Combine probabilities using Bayes’ Rule.

Select Candidates: Using a fixed threshold determined from development data (.075), select time points above threshold for classification

Classification: For each candidate, compute cepstral features at that time point and surrounding time points. Use Artificial Neural Network (ANN) to classify features as “burst” or “non-burst”.

43

Detecting Bursts

4. Candidate Selection Method (illustration):

44

Detecting Bursts

Results, relative to number of burst and non-burst phonemes (not frames). Threshold of 20 msec, evaluated on TIMIT corpus.

CANDIDATE

Documents

Estimating Speech Parameters October, 2006 Oregon Health & Science University