Upload
kitra-mayer
View
37
Download
2
Embed Size (px)
DESCRIPTION
Estimating Speech Parameters October, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul Hosom. Estimating and Detecting Speech Parameters. Topics for this lecture: Computing Energy Linear Predictive Coding (LPC) (and Estimating Formants) - PowerPoint PPT Presentation
Citation preview
1
Estimating Speech Parameters
October, 2006Oregon Health & Science University
OGI School of Science & Engineering
John-Paul Hosom
2
Estimating and Detecting Speech Parameters
Topics for this lecture:
1. Computing Energy2. Linear Predictive Coding (LPC) (and Estimating Formants)3. Estimating Pitch4. Detecting Glottalization5. Detecting Bursts
• Except for energy, these parameters can not be computed with 100% accuracy. Therefore, they are not often used in automatic speech recognition systems.
• However, reliable estimation/detection of these parameters can be useful for voice transformation, speech synthesis, and speech analysis.
• Pitch estimation methods are quite numerous. Four basic methods and one new method will be presented here. Methods for detecting glottalization and bursts are less prevelant. A few methods for each will be presented here.
3
Energy
“Energy” or “Intensity”:intensity is sound energy transmitted per second (power) through a unit area in a sound field. [Moore p. 9]
intensity is proportional to the square of the pressure variation [Moore p. 9]
normalized energy = = intensity
xn = signal amplitude x at time sample nN = number of time samples
N
xNt
tnn
12
4
Energy
“Energy” or “Intensity”:human auditory system better suited to relative scales:
energy (bels) =
energy (decibels, dB) =
I0 is a reference intensity… if the signal becomes twice aspowerful (I1/I0 = 2), then the energy level is 3 dB (3.01023 dBto be more precise)
Typical theoretical value for I0 is 20 Pa. (20 Pa is close to the average human absolute threshold for a 1000-Hz sinusoid.)
Typical practical value for I0 is 1.0
)(log0
110 I
I
)(log100
110 I
I
5
Energy
What is a good value of N? Depends on information of interest:
N=1 msec
N=5 msec
N=20 msec
N=80 msec
6
Autocorrelation:measure of periodicity in signal
LPC: Background: Autocorrelation
m
kmxmxk )()()(
time
ampl
itude
KkkmymykRkN
mnnn
0)()()(1
0
7
LPC: Model
Linear Predictive Coding (LPC) provides• low-dimension representation of speech signal at one frame• representation of spectral envelope, not harmonics• “analytically tractable” method• some ability to identify formants
LPC models speech as approximate linear combination of previous p samples:
where a1, a2, … ap are constant for each frame of speech.
We can make the approximation exact by including a“difference” or “residual” term, e(n) = G u(n), which is the excitation of the signal if the LPC coefficients are a filter:
)()2()1()( 21 pnsansansans p
p
kk
p
kk nGuknsaneknsans
11
)()()()()(
where u(n) is the excitation of the filter and G is a gain term.
8
LPC: Model
If we define the error over some range of values M1 to M2 as:
Then we can find ak by setting En/ak = 0 for k = 1,2,…p, obtaining p equations and p unknowns. After some derivation, the p equations can be related to the autocorrelation coefficients Rn(i):
p
knnk piiRkiRa
1
1)(|)(|
2
1
2
2
1
2
1
)()(
)(
M
Mm
p
knkn
M
Mmnn
kmsams
meE
and an exact solution to this system of linear equations can beobtained using an iterative process called “Durbin’s Solution”
9
LPC: Predictor Coefficients as Filter Coefficients
The error term e(n) can be written as
Taking the z-transform of this equation (noting the time-shift property of the z-transform, ):
where A(z) is a transfer function specified by the LPC coefficients. We can write the original signal in terms of the error signal E(z) and the transfer function A(z), and approximate E(z) by a constant gainterm (since the error should have a flat spectrum):
The LPC coefficients are therefore an all-pole (IIR) filter that models the spectral shape (spectral envelope) of the input speech (formants and spectral tilt due to glottal source).
)()(
)()(
zA
G
zA
zEzS
p
kk knsansnsnsnGune
1
)()()(~)()()(
)()(1)()()()(11
zAzSzazSzzSazSzEp
k
kk
p
k
kk
)()}({ zXzknxZ k
10
because , the log power spectrum is:
LPC: Spectral Representation
We can compute spectral envelope magnitude from LPC parameters by evaluating the transfer function S(z) for z=ej:
22
2
2
2
122
11
}Im{}Re{log10
}Im{}Re{log10)(
0)2
sin(}Im{)2
cos(1}Re{
AA
G
AA
Gn
NnN
nkaA
N
nkaA
p
kk
p
kk
Each formant (complex pole) in spectrum requires two LPC coefficients; each spectral slope factor (frequency=0 or Nyquist frequency) requires one LPC coefficient.
For 8 kHz speech, 4 formants LPC order of 9 or 10
p
k
kjk
jj
ea
G
eA
GeS
1
1)(
)(
)sin()cos( je j
11
LPC: Spectral Representation
12
LPC: Estimating Formants
The transfer function
can be re-written as a product
p
k
kk zazA
1
1)(
p
kk zzzA
1
1)1()(
where zk are the roots of the predictor polynomial.
Roots for (resonant) poles that aren’t at 0 Hz or the Nyquist frequency will occur in pairs, symmetric around the real (x) axis. If we solve for the roots zk, we can determine the frequencies and bandwidths of the poles.
real(z)1-1
j
-j
imag
inar
y(z)
r
r
-
13
LPC: Estimating Formants
We can express these complex roots (or poles of the filter) in terms of angle and radius on the unit circle by converting from Cartesian coordinates in the complex plane to polar coordinates .
Angle corresponds to frequency, and radius corresponds to bandwidth. So we can determine the pole (or resonant) frequencies and bandwidths (converting to Hz) as:
Formants are typically the resonances with the smallest bandwidths.
skk
sk
s Fzz
Fz
Fr 22 }Im{}Re{log))(abslog()log(bw
})Re{},(Im{2arctan22
frequency kkss zz
FF
14
Pitch Estimation: Autocorrelation Method
Autocorrelation of speech signals: (from Rabiner & Schafer, p. 143)
15
Pitch Estimation: Autocorrelation Method
Autocorrelation (AC) can be used to determine F0, by finding the local maximum in the AC signal that is (a) within range of expected F0 values and (b) above a threshold:
However, the local maximum does not always correspond withthe correct T0 value (F0 = 1/T0)
30 samples = 8000/30 = 266 Hz = F0max
100 samples = 8000/1000 = 80 Hz = F0min
16
Pitch Estimation: Autocorrelation Method
Problems:
1. A high F0 (e.g. 180 Hz) will have two peaks within range, e.g. 180 Hz and 90 Hz. This may cause “pitch halving” error
2. Formants will influence the strength of peaks. For example, if F0 is 120 Hz, but the strongest energy in the waveform is due to the first formant at 240 Hz (e.g. the vowel /i:/), the highest local maxium in the AC may be at 240 Hz (“pitch doubling” error)
Want an F0 estimation method that is not sensitive to formants
120 240 360 480 600
17
SIFT Method: LPC analysis (order 5) of low-pass-filtered waveform (800 Hz) inverse filter: obtain signal without formants autocorrelation: measure periodicity decision based on height of autocorrelation peak
Pitch Estimation: SIFT Method
frequencytime frequency
frequency frequency time
4000 Hz 800 Hz
18
Problems with SIFT:
1. For some sounds (e.g. nasals, vowel-to-silence transitions) the signal is dominated by a single harmonic (close to sine wave).LPC analysis and inverse filtering can remove the formant information and the glottal-source information, leaving only white noise.
2. Still have the problem that a high F0 (e.g. 180 Hz) will have two AC peaks within range, e.g. 180 Hz and 90 Hz. This may cause “pitch halving” error
Rather than use autocorrelation, we can measure F0 from information in the spectrum, by identifying harmonics
Pitch Estimation: SIFT Method
19
Pitch Estimation: Harmonic Sieve Method
F0 estimation in the spectral domain often uses the “harmonicsieve” method, that relies on the fact that F0 harmonics must occurat multiples of the fundamental frequency. If we sum the power-spectrum energy values at multiples of a given frequency, then the frequency value that yields the largest energy sum should be F0.
test F0 = 100 HzNormalized sum of energies at 100, 200, …, 4000 Hz is maximum
test F0 = 110 HzNormalized sum of energies at 110, 220, …, 3960 Hz is relatively small
20
Pitch Estimation: Harmonic Sieve Method
Problems with harmonic sieve method:
1. In order to resolve harmonics, need high frequency resolution in power spectrum. This requires a large number of waveform samples at each frame (e.g. 256 samples (32 msec) or more). F0 may change quickly within 30 or 40 msec, and these quick changes in F0 can not be reliably identified.
2. This method is susceptible to pitch-doubling errors, because (a) normalization by number of harmonics will reduce energy in low-F0 case to be approximately equal to energy of doubled F0, and (b) if normalization is not performed, bias toward lower F0 values that includes more harmonics.
21
cepstrum: treat spectrum as signal subject to frequency analysis…1. Compute log power spectrum2. Compute FFT of log power spectrum
Pitch Estimation: Cepstral Method
Cepstral method: F0 information is encoded in higher cepstralcoefficients
this peak indicates F0
time
ampl
itude
frequency
ener
gy (
dB)
ampl
itude
quefrency
frequency
ener
gy (
dB)
22
Pitch Estimation: Cepstral Method
F0 estimation in the cepstral domain:(1) compute the cepstrum(2) determine if there are values within normal F0 ranges
that are above some pre-defined threshold(3) if there are such values, find the maximum value(4) F0 is computed from the inverse of the index of this maximum
F0max F0min
23
Pitch Estimation: Cepstral Method
Problems with the cepstral method:
1. Often there are only a few harmonics (e.g. nasals, /w/) that identify F0. In this case, the height of the cepstral peak will be very low, leading to peak-location (F0) estimation errors
2. Humans can identify pitch from a small number of harmonics, and some of the harmonics may be “missing”. Cepstral method is not robust in these cases
200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k
perceived pitch = 200 Hz
missing fundamentalperceived pitch = ??? Hz
most harmonics missingperceived pitch = ??? Hz
200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k
200 400 600 800 1k 1.2k 1.4k 1.6k 1.8k 2.0k 2.2k 2.4k 2.6k
24
Many errors in F0 estimation occur briefly (especially at the beginning and end of voiced sounds) and have a large difference from (a) the correct F0 value and (b) neighboring F0 values:
One method of obtaining a smooth F0 contour is to apply a Viterbi search to a number of F0 estimates at each frame. In this case, the transition probability is constrained so that large changes in F0 from one frame to the next are prohibited
Pitch Estimation: Dynamic Programming
AC
(T
0) v
alue
t = 1 2 3 4 5
transition from previousframe limited to range of neighboring T0 values
maximum value att=13 is here
25
Pitch Estimation: Band-Pass Method (new)
If the expected F0 range is known in advance, constraints (e.g. limiting F0max and F0min) can be applied to reduce these errors. For F0 estimation for both children and adults, the expected F0 range can be too large (e.g. 50 to 400 Hz) for such constraints to be effective.
The “band-pass” algorithm is based on an interpretation Moore’s summarization of pitch identification in humans. In the band-pass method, information from 32 band-pass filters is combined at every frame, and then a Viterbi search provides an F0 contour estimate.
This method does not require constraints on the range of F0, as one set of parameters can be used for adult and children’s speech. It does not utilize autocorrelation, LPC, harmonic seive, or cepstrum, and so it does not make assumptions about the correlation between pitch periods, nature of the glottal source, or existence of a large number of harmonics. It is robust to different formant frequencies.
26
Pitch Estimation: Band-Pass Method (new)
STEP (1): FILTERING
The speech signal is passed through 32 9-tap IIR filters, with (narrow) filter bandwidths determined from the Equivalent Rectangular Bandwidth (ERB) scale:
ERB(f) = 0.108f + 24.7
where f is the center frequency, in Hz. As a result, in most cases, no more than one harmonic occupies one filter’s frequency range.
The first filter is centered at 100 Hz, and each subsequent filter has a center frequency approximately one-half bandwidth higher than the previous filter’s center frequency.
These filter outputs are approximately sine functions, with frequency equal to the harmonic within (or closest to) the filter’s bandwidth. In cases where multiple harmonics are within one band, the outputs are no longer simple sine functions, and are discarded.
27
Pitch Estimation: Band-Pass Method (new)STEP (1): FILTERING
Filters:
From To Bandwidth Idx82 118 36 199 137 38 2117 156 39 3135 177 42 4155 199 44 5177 223 46 6197 245 48 7221 272 51 8242 296 54 9… … …902 1031 129 26957 1092 135 271024 1167 143 281085 1235 150 291159 1318 159 301226 1393 167 311309 1484 175 32
28
Pitch Estimation: Band-Pass Method (new)
STEP (2): FIND PERIODICITY
The period-to-period maxima of the (sine-wave) filter outputs are located in order to identify the periodicity of the signal at each frame. If the identified periodicity is beyond the frequency limits of the filter (noise or multiple harmonics), the periodicity is set to zero.
603-697 Hz
Find periodicity in each band at this frameby simple location of local maxima.
841-963 Hz
792-908 Hz
737-847 Hz
692-797 Hz
643-742 Hz
29
Pitch Estimation: Band-Pass Method (new)
STEP (3): CREATE HISTOGRAM
For each frame (e.g. 1 msec), a histogram is computed:
(A) Initialize the histogram, with one bin for each periodicity value(154 values, representing 50 – 1334 Hz), with all bins set to 0.
(B) Determine the filter output with greatest energy at this frame, Me
(C) For each filter with energy > Me- (where = 12dB) andperiodicity > 0, the histogram is increased by 1.0 near thisperiodicity value p, within the range p-5 to p+5. This increase is repeated for all integer multiples of p.
(D) The maximum histogram value, Mh, and bin containing thismaximum, b, are determined. All values are normalizedby Mh. All values at bins greater than 2b are decreasedslightly (by a factor of 0.95) to avoid F0-halving errors.
Frequency is determined from periodicity by Fs/p, where Fs is the sampling frequency, and p is a periodicity value.
30
Pitch Estimation: Band-Pass Method (new)
1500 Hz
0 Hz
500 Hz
1000 Hz
82-118 99-137 117-156135-177 155-199 177-223197-245 221-272 242-296269-326 293-352 323-386350-416382-452
412-485448-526
481-562522-607
558-648603-697
643-742692-797
737-847792-908
841-963902-1031
957-10921024-1167
1085-12351159-1318
1226-1393
1309-1484
889 Hz889 Hz889 Hz800 Hz727 Hz667 Hz
444 Hz400 Hz444 Hz242 Hz228 Hz235 Hz222 Hz
667 Hz
889 Hz = periodicity of 9 samples
9x1=889Hz(3 cnts)
9x2=444Hz
9x3=296Hz
9x4=222Hz
hist
ogra
m c
ount
216 Hz
9 12 15 18 21 24 27 30 3336 39 6972 75
9x8=111Hz
31
STEP (4): VITERBI SEARCH
A Viterbi search is performed on the sequence of histograms at each time frame. Transitions are constrained between frames t and t+1 to change by no more than 2 periodicity values (e.g. 2.5 Hz/msec when F0=100 Hz, 10 Hz/msec when F0=200 Hz). The result is the F0 contour with the largest global histogram value.
Pitch Estimation: Band-Pass Method (new)
32
Pitch Estimation: Band-Pass Method (new)
One issue is the possibility of pitch-halving errors, because lower F0 values can have an equally-large histogram count as the correct F0. This is avoided by (a) finding the first large peak, and (b) multiplying all T0 values above this peak by 0.95
Another issue is that this method will find an F0 value for all frames of speech, even frames that are unvoiced. (There is no threshold as in the cepstral method to determine whether a frame is voiced or unvoiced.)
33
Pitch Estimation: Band-Pass Method (new)Two corpora were used in evaluation, MWM and LSR corpora.
Average F0 for the MWM corpus was 118 Hz (range 50–250 Hz); average F0 for the LSR corpus was 250 Hz (range 96–402 Hz).
Evaluation was performed by (a) computing the average absolute difference between correct F0 and measured F0, over all frames at which an F0 value was obtained, (b) computing average percent error (absolute difference / correct F0), over all such frames.
In addition, results on the LSR corpus were manually compared with Kay Elemetrics’ CSL F0 estimation when the difference between the results in either vowel of a word exceeded 30 Hz.
Corpus Name
Speakers Material Recording Conditions
Correct F0 Measurement
MWM 1 adult male 450 sentences anechoic booth, high quality microphone
laryngograph-based, determined every 5
msec
LSR 35 children, ages 3-14
8 isolated 2-syllable words
Tascam cassette recorder
manual estimation at vowel centers
34
Pitch Estimation: Band-Pass Method (new)
For the comparison with CSL on the LSR corpus, out of 33 words with at least one vowel having an F0 difference greater than 30 Hz, CSL had 30 errors greater than 30 Hz, while the proposed method had 8 errors greater than 30 Hz.
Corpus Avg. Absolute Difference
Percent Error
Standard Deviation
MWM 3.99 Hz 3.86% 12.19 Hz
LSR 6.20 Hz 2.58% 13.59 Hz
Method Avg. Absolute Diff. Standard Deviation
Harris & Nelson 9.88 Hz 30.08 Hz
SIFT 9.01 Hz 29.06 Hz
Modified SIFT 5.71 Hz 21.69 Hz
Band-Pass 3.99 Hz 12.19 Hz
Results:
35
Detecting Glottalization
What is glottalization? (Also called “creaky voice”)
Here, we define it as irregular or low-frequency (20-70 Hz)vibration of vocal folds during voicing. (The term “glottalization”also used when describing certain articulations of stop consonants)
Glottalization can occur quite often in some speakers, as a speakingstyle. It may occur frequently at end of sentence, when signaling phoneme boundary between two similar sounds (e.g. “E.E.”), or when signaling word boundary between potentially ambiguous words (e.g. “heavy oak” vs. “heavy yoke”). It may also occur morefrequently when a speaker has been talking a lot.
Very little published work on detecting glottalization. However,glottalization makes F0 estimation and synthesis difficult, and itmay be a relevant factor in diagnosing speech disorders.
36
Detecting Glottalization
Three techniques: PtP Amplitude, Feature Classification, Autocorrelation
Peak-to-Peak Amplitude (Cole, 1988)
Compute peak-to-peak amplitude (difference between max and min amplitude) of signal using variable-length analysis window.
Assumption: glottalization has F0 significantly smaller than surrounding voiced sounds. Therefore, F0-related amplitude changes can be detected using analysis window length determined from 1.3 times the median F0.
wave
spectrogram
phonemes
long-term F0
PtP amplitude
37
Detecting Glottalization
Three techniques: PtP Amplitude, Feature Classification, Autocorrelation
Feature Classification (Hosom, 2000)
Use neural-network classifier with (a) standard MFCC or PLPfeatures, or (b) standard features plus relative change in energyusing analysis window of 2(long-term F0).
Assumption: standard classifier can identify changes in energy and source characteristics using standard features, or standard features augmented with feature similar to PtP feature.
Results (insertion and deletion errors, within 20 msec):
CorpusBaseline
error rate (%)Baseline + Rel.Energy
error rate (%)
TIMIT 13.23 12.08
Stories 17.43 17.78
Portland Cellular 16.54 16.89
38
Detecting Glottalization
Three techniques: PtP Amplitude, Feature Classification, Autocorrelation
Autocorrelation (Ishi, 2004)
Estimate glottal-source waveform by inverse-filtering LPCcoefficients. Compute autocorrelation of glottal-sourcewaveform.
Assumption: long delay between impulses in glottalized speech yields non-zero correlations in between glottal pulses
norm
al s
peec
h
glot
tali
zed
spee
ch
(from Ishi, 2004)
39
Detecting Glottalization
Three techniques: PtP Amplitude, Feature Classification, Autocorrelation
A decision tree was applied to various parameters determined from the first two autocorrelation peaks (e.g. relative peak amplitude, relative peak position)
Decision tree yielded error rate of 21.6%. However, definition of “creaky” included abnormal F0 patterns as well as low-F0 patterns.
40
Detecting Bursts
What are bursts?
Increase in energy that is characteristic of stop consonants, after the closure, when buildup of pressure has been released
Not much prior work on detecting bursts. However, detecting bursts is critital for measuring voice-onset-time (VOT), which is the time from burst to onset of voicing.
VOT is important for phoneme identity (e.g. distinguishing /p/ from /b/), and may be important in detection of Parkinson’s Disease (PD), where control over VOT may be reduced, leading to reduced intelligibility.
41
Detecting Bursts
Four basic methods: Change in Energy, HMM, SVM Classifier, and Candidate Selection.
1. Change in Energy (Liu, 95)bursts characterized by closure (silence) then burst (high energy), so compute change in energy over entire utterance; if the change is above a threshold, mark as a burst
2. HMM (Niyogi, 99)use a phoneme-level HMM to identify all phonemes in an utterance. Beginning of each plosive is identified as a burst.
3. Support-Vector Machine (SVM) Classifier (Niyogi, 99; Keshet, 01)two SVMs implemented, for linear and non-linear classification, using as features log energy of entire spectrum, log energy of 3 to 8 kHz, and a spectral flatness measure
4. Candidate Selection (Hosom, 00)select “candidate bursts” based on change in relative energy; classify candidates using ANN with cepstral features.
42
Detecting Bursts
4. Candidate Selection Method (in detail):
Generate Candidates: Measure relative change in energy at eight Bark-scale frequency bands. Perform equal-loudness weighting of energy bands, so that perceptually-relevant bands have greater weight. Transform energy values to scale 0 to 1, representing “probability of burst in this band”. Combine probabilities using Bayes’ Rule.
Select Candidates: Using a fixed threshold determined from development data (.075), select time points above threshold for classification
Classification: For each candidate, compute cepstral features at that time point and surrounding time points. Use Artificial Neural Network (ANN) to classify features as “burst” or “non-burst”.
43
Detecting Bursts
4. Candidate Selection Method (illustration):
44
Detecting Bursts
Results, relative to number of burst and non-burst phonemes (not frames). Threshold of 20 msec, evaluated on TIMIT corpus.
CANDIDATE