35
Speech Signal Speech Signal Representations I Representations I Seminar Speech Recognition Seminar Speech Recognition 2002 2002 F.R. Verhage F.R. Verhage

Speech Signal Representations I

Embed Size (px)

DESCRIPTION

Speech Signal Representations I. Seminar Speech Recognition 2002 F.R. Verhage. Speech Signal Representations I. Decomposition of the speech signal (x[n]) as a source (e[n]) passed through a linear time-varying filter (h[n]). Speech Signal Representations I. - PowerPoint PPT Presentation

Citation preview

Page 1: Speech Signal Representations I

Speech Signal Speech Signal Representations IRepresentations I

Seminar Speech Recognition 2002Seminar Speech Recognition 2002

F.R. VerhageF.R. Verhage

Page 2: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Decomposition of the speech signal (x[n]) as a Decomposition of the speech signal (x[n]) as a source (e[n]) passed through a linear time-source (e[n]) passed through a linear time-varying filter (h[n]).varying filter (h[n]).

Page 3: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Estimation of the filter, inspired by:Estimation of the filter, inspired by: Speech production modelsSpeech production models

– Linear Predictive Coding (LPC)Linear Predictive Coding (LPC)– Cepstral analysisCepstral analysis

Speech perception models (part II)Speech perception models (part II)– Mel-frequency cepstrumMel-frequency cepstrum– Perceptual Linaer Prediction (PLP)Perceptual Linaer Prediction (PLP)

Speech recognizers estimate filter Speech recognizers estimate filter characteristics and ignore the sourcecharacteristics and ignore the source

Page 4: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

SpectrogramSpectrogram– Representation of a signal highlighting several Representation of a signal highlighting several

of its properties based on short-time Fourier of its properties based on short-time Fourier analysisanalysis

– Two dimensional: time horizontal and frequency Two dimensional: time horizontal and frequency verticalvertical

– Third ‘dimension’: gray or color level indicating Third ‘dimension’: gray or color level indicating energyenergy

Page 5: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

SpectrogramSpectrogram– Narrow bandNarrow band

Long windows (> 20 ms) →Long windows (> 20 ms) → Narrow bandwidthNarrow bandwidth Lower time resolution, better frequency resolutionLower time resolution, better frequency resolution

– Wide bandWide band Short windows ( <10 ms) →Short windows ( <10 ms) → Wide bandwidthWide bandwidth Good time resolution, lower frequency resolutionGood time resolution, lower frequency resolution

– Pitch synchronousPitch synchronous Requires knowledge of local pitch periodRequires knowledge of local pitch period

Page 6: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

SpectrogramSpectrogram

Page 7: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Window analysisWindow analysis– Series of short segments, analysis framesSeries of short segments, analysis frames– Short enough so that the signal is stationaryShort enough so that the signal is stationary– Usually constant, 20-30 msUsually constant, 20-30 ms– Overlaps possibleOverlaps possible

– Different types of window functions (wDifferent types of window functions (wmm[n]):[n]): Rectangular (equal to no window function)Rectangular (equal to no window function) HammingHamming HanningHanning

n n

njnjm

jm enxnmwenxeX ][][][

Page 8: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Window analysisWindow analysis– Window size must be long enoughWindow size must be long enough

Rectangular: N ≥ MRectangular: N ≥ M Hamming, Hanning: N ≥ 2MHamming, Hanning: N ≥ 2M

– Pitch period not known in advance →Pitch period not known in advance →– Prepare for lowest pitch period →Prepare for lowest pitch period →– At least 20ms for rectangular or 40ms for At least 20ms for rectangular or 40ms for

Hamming/Hanning (50Hz)Hamming/Hanning (50Hz)– But longer windows give a more average spectrum But longer windows give a more average spectrum

instead of distinct spectra →instead of distinct spectra →– Rectangular window has better time resolutionRectangular window has better time resolution

Page 9: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 10: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 11: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 12: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 13: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 14: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 15: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 16: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Window analysisWindow analysis– Frequency response not completely zero outside main Frequency response not completely zero outside main

lobe → Spectral leakagelobe → Spectral leakage– Second lobe of a Hamming window is approx. 43dB Second lobe of a Hamming window is approx. 43dB

below main lobe → less spectral leakagebelow main lobe → less spectral leakage– Hamming, Hanning, triangular windows offer less Hamming, Hanning, triangular windows offer less

spectral leakage →spectral leakage →– Rectangular windows are rarely used despite their Rectangular windows are rarely used despite their

better time resolutionbetter time resolution

Page 17: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 18: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 19: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 20: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Page 21: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Short-time spectrum of male voice speechShort-time spectrum of male voice speecha)a) Time signal /ah/Time signal /ah/

local pitch 110Hzlocal pitch 110Hz

b)b) 30ms rectangular30ms rectangularwindowwindow

c)c) 15ms rectangular15ms rectangular window window

d)d) 30ms Hamming30ms Hammingwindowwindow

e)e) 15ms Hamming15ms Hammingwindowwindow

Page 22: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Short-time spectrum of female voice speechShort-time spectrum of female voice speecha)a) Time signal /aa/Time signal /aa/

local pitch 200Hzlocal pitch 200Hz

b)b) 30ms rectangular30ms rectangularwindowwindow

c)c) 15ms rectangular15ms rectangular window window

d)d) 30ms Hamming30ms Hammingwindowwindow

e)e) 15ms Hamming15ms Hammingwindowwindow

Page 23: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Short-Time Fourier AnalysisShort-Time Fourier Analysis

Short-time spectrum of unvoiced speechShort-time spectrum of unvoiced speecha)a) Time signalTime signal

b)b) 30ms rectangular30ms rectangularwindowwindow

c)c) 15ms rectangular15ms rectangular window window

d)d) 30ms Hamming30ms Hammingwindowwindow

e)e) 15ms Hamming15ms Hammingwindowwindow

Page 24: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

LPC a.k.a. auto-regressive (AR) modelingLPC a.k.a. auto-regressive (AR) modeling All-pole filter is good approximation of speech, All-pole filter is good approximation of speech,

with p as the order of the LPC analysis:with p as the order of the LPC analysis:

Predicts current sample as linear combination of Predicts current sample as linear combination of past p samplespast p samples

)(

1

1

1

)(

)()(

1

zAza

zE

zXzH p

k

kk

p

kk knxanx

1

~

Page 25: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

To estimate predictor coefficients (aTo estimate predictor coefficients (akk), use short-), use short-

term analysis techniqueterm analysis technique Per segment, minimize the total prediction error by Per segment, minimize the total prediction error by

calculating the minimum squared errorcalculating the minimum squared error

Take the derivative, equate it to 0; expressed as a Take the derivative, equate it to 0; expressed as a set of p linear equations:set of p linear equations:

the the Yule-WalkerYule-Walker equations equations

n n n

p

kmkmmmmm knxanxnxnxneE

2

22 ~

p

kmmk ikia

1

0,,

Page 26: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

Solution of the Solution of the Yule-WalkerYule-Walker equations: equations:– Any standard matrix inversion packageAny standard matrix inversion package– Due to the special form of the matrix, efficient solutions:Due to the special form of the matrix, efficient solutions:

Covariance methodCovariance methodusing the using the CholeskyCholesky decomposition decomposition

Autocorrelation methodAutocorrelation methodusing windows, results in equations with using windows, results in equations with ToeplitzToeplitz matrices, matrices, solved by the solved by the DurbinDurbin recursion algorithm recursion algorithm

Lattice methodLattice methodequivalent to equivalent to Levinson DurbinLevinson Durbin recursion recursionoften used in fixed-point implementations because lack of often used in fixed-point implementations because lack of precision doesn’t result in unstable filtersprecision doesn’t result in unstable filters

Page 27: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

Page 28: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

Page 29: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

Spectral analysis via LPCSpectral analysis via LPC– All-pole (IIR) filterAll-pole (IIR) filter– Peaks at the roots of the denominatorPeaks at the roots of the denominator

Page 30: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

Prediction errorPrediction error– Should be (approximately) the excitationShould be (approximately) the excitation– Unvoiced speech, expect white noise; OKUnvoiced speech, expect white noise; OK– Voiced speech, expect impulse train; NOKVoiced speech, expect impulse train; NOK

All-pole assumption not altogether validAll-pole assumption not altogether valid Real speech not perfectly periodicReal speech not perfectly periodic Pitch synchronous analysis gives better resultsPitch synchronous analysis gives better results

– LPC orderLPC order Larger p gives lower prediction errorsLarger p gives lower prediction errors Too large a p results in fitting the individual harmonics →Too large a p results in fitting the individual harmonics →

separation between filter and source will not be so goodseparation between filter and source will not be so good

Page 31: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

Prediction errorPrediction error– Inverse LPC filter gives residual signalInverse LPC filter gives residual signal

Page 32: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Linear Predictive CodingLinear Predictive Coding

Alternatives for the predictor coefficientsAlternatives for the predictor coefficients– Line Spectral FrequenciesLine Spectral Frequencies

local sensitivitylocal sensitivity efficiencyefficiency

– Reflection CoefficientsReflection Coefficients Guaranteed stable → useful for coefficient interpolated over Guaranteed stable → useful for coefficient interpolated over

timetime

– Log-area ratiosLog-area ratios Flat spectral sensitivityFlat spectral sensitivity

– Roots of the polynomialRoots of the polynomial Represent resonance frequencies and bandwidthsRepresent resonance frequencies and bandwidths

Page 33: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Cepstral ProcessingCepstral Processing

– A homomorphic transformation converts a A homomorphic transformation converts a convolution into a sum:convolution into a sum:

nhnenx

nhnenx

ˆˆˆ

Page 34: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Cepstral ProcessingCepstral Processing

Page 35: Speech Signal Representations I

Speech Signal Representations ISpeech Signal Representations I

Cepstral ProcessingCepstral Processing