Speech Signal Representations I

Speech Signal Speech Signal Representations IRepresentations I

Seminar Speech Recognition 2002Seminar Speech Recognition 2002

F.R. VerhageF.R. Verhage

Speech Signal Representations ISpeech Signal Representations I

Decomposition of the speech signal (x[n]) as a Decomposition of the speech signal (x[n]) as a source (e[n]) passed through a linear time-source (e[n]) passed through a linear time-varying filter (h[n]).varying filter (h[n]).


Estimation of the filter, inspired by:Estimation of the filter, inspired by: Speech production modelsSpeech production models

– Linear Predictive Coding (LPC)Linear Predictive Coding (LPC)– Cepstral analysisCepstral analysis

Speech perception models (part II)Speech perception models (part II)– Mel-frequency cepstrumMel-frequency cepstrum– Perceptual Linaer Prediction (PLP)Perceptual Linaer Prediction (PLP)

Speech recognizers estimate filter Speech recognizers estimate filter characteristics and ignore the sourcecharacteristics and ignore the source


Short-Time Fourier AnalysisShort-Time Fourier Analysis

SpectrogramSpectrogram– Representation of a signal highlighting several Representation of a signal highlighting several

of its properties based on short-time Fourier of its properties based on short-time Fourier analysisanalysis

– Two dimensional: time horizontal and frequency Two dimensional: time horizontal and frequency verticalvertical

– Third ‘dimension’: gray or color level indicating Third ‘dimension’: gray or color level indicating energyenergy



SpectrogramSpectrogram– Narrow bandNarrow band

Long windows (> 20 ms) →Long windows (> 20 ms) → Narrow bandwidthNarrow bandwidth Lower time resolution, better frequency resolutionLower time resolution, better frequency resolution

– Wide bandWide band Short windows ( <10 ms) →Short windows ( <10 ms) → Wide bandwidthWide bandwidth Good time resolution, lower frequency resolutionGood time resolution, lower frequency resolution

– Pitch synchronousPitch synchronous Requires knowledge of local pitch periodRequires knowledge of local pitch period



SpectrogramSpectrogram



Window analysisWindow analysis– Series of short segments, analysis framesSeries of short segments, analysis frames– Short enough so that the signal is stationaryShort enough so that the signal is stationary– Usually constant, 20-30 msUsually constant, 20-30 ms– Overlaps possibleOverlaps possible

– Different types of window functions (wDifferent types of window functions (wmm[n]):[n]): Rectangular (equal to no window function)Rectangular (equal to no window function) HammingHamming HanningHanning

n n

njnjm

jm enxnmwenxeX ][][][



Window analysisWindow analysis– Window size must be long enoughWindow size must be long enough

Rectangular: N ≥ MRectangular: N ≥ M Hamming, Hanning: N ≥ 2MHamming, Hanning: N ≥ 2M

– Pitch period not known in advance →Pitch period not known in advance →– Prepare for lowest pitch period →Prepare for lowest pitch period →– At least 20ms for rectangular or 40ms for At least 20ms for rectangular or 40ms for

Hamming/Hanning (50Hz)Hamming/Hanning (50Hz)– But longer windows give a more average spectrum But longer windows give a more average spectrum

instead of distinct spectra →instead of distinct spectra →– Rectangular window has better time resolutionRectangular window has better time resolution

















Window analysisWindow analysis– Frequency response not completely zero outside main Frequency response not completely zero outside main

lobe → Spectral leakagelobe → Spectral leakage– Second lobe of a Hamming window is approx. 43dB Second lobe of a Hamming window is approx. 43dB

below main lobe → less spectral leakagebelow main lobe → less spectral leakage– Hamming, Hanning, triangular windows offer less Hamming, Hanning, triangular windows offer less

spectral leakage →spectral leakage →– Rectangular windows are rarely used despite their Rectangular windows are rarely used despite their

better time resolutionbetter time resolution











Short-time spectrum of male voice speechShort-time spectrum of male voice speecha)a) Time signal /ah/Time signal /ah/

local pitch 110Hzlocal pitch 110Hz

b)b) 30ms rectangular30ms rectangularwindowwindow

c)c) 15ms rectangular15ms rectangular window window

d)d) 30ms Hamming30ms Hammingwindowwindow

e)e) 15ms Hamming15ms Hammingwindowwindow



Short-time spectrum of female voice speechShort-time spectrum of female voice speecha)a) Time signal /aa/Time signal /aa/

local pitch 200Hzlocal pitch 200Hz







Short-time spectrum of unvoiced speechShort-time spectrum of unvoiced speecha)a) Time signalTime signal






Linear Predictive CodingLinear Predictive Coding

LPC a.k.a. auto-regressive (AR) modelingLPC a.k.a. auto-regressive (AR) modeling All-pole filter is good approximation of speech, All-pole filter is good approximation of speech,

with p as the order of the LPC analysis:with p as the order of the LPC analysis:

Predicts current sample as linear combination of Predicts current sample as linear combination of past p samplespast p samples

)(

1

1

1

)(

)()(

1

zAza

zE

zXzH p

k

kk

p

kk knxanx

1

~



To estimate predictor coefficients (aTo estimate predictor coefficients (akk), use short-), use short-

term analysis techniqueterm analysis technique Per segment, minimize the total prediction error by Per segment, minimize the total prediction error by

calculating the minimum squared errorcalculating the minimum squared error

Take the derivative, equate it to 0; expressed as a Take the derivative, equate it to 0; expressed as a set of p linear equations:set of p linear equations:

the the Yule-WalkerYule-Walker equations equations

n n n

p

kmkmmmmm knxanxnxnxneE

2

22 ~

p

kmmk ikia

1

0,,



Solution of the Solution of the Yule-WalkerYule-Walker equations: equations:– Any standard matrix inversion packageAny standard matrix inversion package– Due to the special form of the matrix, efficient solutions:Due to the special form of the matrix, efficient solutions:

Covariance methodCovariance methodusing the using the CholeskyCholesky decomposition decomposition

Autocorrelation methodAutocorrelation methodusing windows, results in equations with using windows, results in equations with ToeplitzToeplitz matrices, matrices, solved by the solved by the DurbinDurbin recursion algorithm recursion algorithm

Lattice methodLattice methodequivalent to equivalent to Levinson DurbinLevinson Durbin recursion recursionoften used in fixed-point implementations because lack of often used in fixed-point implementations because lack of precision doesn’t result in unstable filtersprecision doesn’t result in unstable filters







Spectral analysis via LPCSpectral analysis via LPC– All-pole (IIR) filterAll-pole (IIR) filter– Peaks at the roots of the denominatorPeaks at the roots of the denominator



Prediction errorPrediction error– Should be (approximately) the excitationShould be (approximately) the excitation– Unvoiced speech, expect white noise; OKUnvoiced speech, expect white noise; OK– Voiced speech, expect impulse train; NOKVoiced speech, expect impulse train; NOK

All-pole assumption not altogether validAll-pole assumption not altogether valid Real speech not perfectly periodicReal speech not perfectly periodic Pitch synchronous analysis gives better resultsPitch synchronous analysis gives better results

– LPC orderLPC order Larger p gives lower prediction errorsLarger p gives lower prediction errors Too large a p results in fitting the individual harmonics →Too large a p results in fitting the individual harmonics →

separation between filter and source will not be so goodseparation between filter and source will not be so good



Prediction errorPrediction error– Inverse LPC filter gives residual signalInverse LPC filter gives residual signal



Alternatives for the predictor coefficientsAlternatives for the predictor coefficients– Line Spectral FrequenciesLine Spectral Frequencies

local sensitivitylocal sensitivity efficiencyefficiency

– Reflection CoefficientsReflection Coefficients Guaranteed stable → useful for coefficient interpolated over Guaranteed stable → useful for coefficient interpolated over

timetime

– Log-area ratiosLog-area ratios Flat spectral sensitivityFlat spectral sensitivity

– Roots of the polynomialRoots of the polynomial Represent resonance frequencies and bandwidthsRepresent resonance frequencies and bandwidths


Cepstral ProcessingCepstral Processing

– A homomorphic transformation converts a A homomorphic transformation converts a convolution into a sum:convolution into a sum:

nhnenx

nhnenx

ˆˆˆ





Documents

Speech Signal Representations I